Skip to content

ToxicityGuard

Prevents your AI from generating or accepting harmful content across 7 toxicity categories. Essential for customer-facing applications where inappropriate content could damage your brand or harm users.

Why Use ToxicityGuard

The risk: LLMs can be manipulated into generating hate speech, violent content, or other harmful material. Even without manipulation, they may occasionally produce inappropriate responses.

Real incidents: - Chatbots generating racist content after user prompting - AI assistants providing instructions for self-harm - Customer service bots being tricked into threats

ToxicityGuard catches these before they reach your users.

Categories

ToxicityGuard detects 7 categories of harmful content:

Category What It Catches Example
Hate Discrimination, slurs, identity attacks "I hate [group]..."
Violence Threats, graphic violence, glorification "I'm going to hurt..."
Sexual Explicit sexual content, solicitation Adult content
SelfHarm Suicide, self-injury promotion "You should hurt yourself..."
Harassment Bullying, targeted attacks, intimidation "You're worthless..."
Dangerous Dangerous activities, weapons, drugs "How to make explosives..."
Illegal Fraud, theft, other crimes "How to steal a car..."

How It Works

ToxicityGuard uses a fast keyword-based detection with configurable thresholds:

  1. Input is analyzed for toxic patterns across all categories
  2. Each category receives a score from 0.0 to 1.0
  3. If any score exceeds your threshold, the input is blocked
  4. You get detailed category-level scores for logging/analysis

Usage Examples

Basic Usage

Rust:

use oxideshield_guard::{Guard, ToxicityGuard};

let guard = ToxicityGuard::new("toxicity")
    .with_threshold_preset("balanced");  // Use a balanced detection level

let result = guard.check("I hate everyone in that group");

if !result.passed {
    println!("Blocked: {}", result.reason);
    // Output: "Blocked: Toxicity detected: Hate (high)"
}

Python:

from oxideshield import toxicity_guard

guard = toxicity_guard(threshold_preset="balanced")
result = guard.check("I hate everyone in that group")

if not result.passed:
    print(f"Blocked: {result.reason}")
    # Output: "Blocked: Toxicity detected: Hate (high)"

Getting Category Scores

For logging, analytics, or custom handling:

Rust:

use oxideshield_guard::ToxicityGuard;

let guard = ToxicityGuard::new("toxicity");
let scores = guard.analyze("This is some user input");

for (category, level) in scores {
    println!("{}: {}", category, level);
}
// Output:
// Hate: low
// Violence: low
// Sexual: none
// SelfHarm: none
// Harassment: low
// Dangerous: none
// Illegal: none

Python:

from oxideshield import toxicity_guard

guard = toxicity_guard(threshold_preset="balanced")
scores = guard.analyze("This is some user input")

for category, level in scores.items():
    print(f"{category}: {level}")

Filtering Specific Categories

Only check categories relevant to your use case:

Rust:

use oxideshield_guard::{ToxicityGuard, ToxicityCategory};

let guard = ToxicityGuard::new("toxicity")
    .with_threshold_preset("balanced")
    .with_categories(&[
        ToxicityCategory::Hate,
        ToxicityCategory::Violence,
        ToxicityCategory::Harassment,
    ]);

// Only blocks Hate, Violence, and Harassment content
// Sexual, SelfHarm, Dangerous, Illegal are ignored

Real Attack Examples

Hate Speech Blocked

Input:  "Those [slur] people are all the same and should be..."
Result: BLOCKED
        Category: Hate
        Level: high
        Action: Block

Violence Detected

Input:  "I'm going to find you and make you pay for what you did"
Result: BLOCKED
        Category: Violence
        Level: high
        Action: Block

Self-Harm Content Blocked

Input:  "Life is meaningless, you should just end it all"
Result: BLOCKED
        Category: SelfHarm
        Level: high
        Action: Block

Clean Input Allowed

Input:  "Can you help me write a professional email to my boss?"
Result: ALLOWED
        Highest Category: Harassment (low)
        Action: Allow

Configuration Options

Option Type Default Description
threshold_preset string "balanced" Detection sensitivity preset (see guidelines below)
categories list All Categories to check
action Action Block What to do when toxicity detected

Threshold Guidelines

Preset Behavior Use Case
"strict" Very sensitive, may have false positives Children's apps, education
"firm" Catches most offensive content Professional environments
"balanced" Balanced sensitivity (default) General consumer apps
"lenient" Only catches clearly toxic content Adult platforms

Performance

Metric Value
Latency Sub-millisecond per check
Memory Minimal footprint
Throughput Designed for high-volume production workloads

When to Use

Use ToxicityGuard when: - Your AI is customer-facing or public - You need content moderation for compliance - Brand safety is important - You're deploying in regulated industries

Consider skipping when: - Internal tools with trusted users only - The LLM provider already does content filtering - Your use case explicitly allows mature content

Integration with Other Guards

ToxicityGuard works well with: - PatternGuard - Catch jailbreak attempts before toxicity check - PIIGuard - Redact personal data AND filter harmful content - MLClassifierGuard - Layered defense with ML backup

from oxideshield import multi_layer_defense

defense = multi_layer_defense(
    enable_length=True,     # Block jailbreaks first
    enable_toxicity=True,    # Then check for toxic content
    toxicity_preset="balanced",
    strategy="fail_fast"
)

Limitations

  • Context-dependent content: Discussions about violence (news, history) may trigger false positives
  • Coded language: Rapidly evolving slang and dogwhistles may not be detected
  • Sarcasm: Ironic statements may be misclassified

For highest accuracy, combine with MLClassifierGuard for context-aware detection.