Skip to content

ToxicityGuard

Prevents your AI from generating or accepting harmful content across 7 toxicity categories. Essential for customer-facing applications where inappropriate content could damage your brand or harm users.

Why Use ToxicityGuard

The risk: LLMs can be manipulated into generating hate speech, violent content, or other harmful material. Even without manipulation, they may occasionally produce inappropriate responses.

Real incidents: - Chatbots generating racist content after user prompting - AI assistants providing instructions for self-harm - Customer service bots being tricked into threats

ToxicityGuard catches these before they reach your users.

Categories

ToxicityGuard detects 7 categories of harmful content:

Category What It Catches Example
Hate Discrimination, slurs, identity attacks "I hate [group]..."
Violence Threats, graphic violence, glorification "I'm going to hurt..."
Sexual Explicit sexual content, solicitation Adult content
SelfHarm Suicide, self-injury promotion "You should hurt yourself..."
Harassment Bullying, targeted attacks, intimidation "You're worthless..."
Dangerous Dangerous activities, weapons, drugs "How to make explosives..."
Illegal Fraud, theft, other crimes "How to steal a car..."

How It Works

ToxicityGuard uses a fast keyword-based detection with configurable thresholds:

  1. Input is analyzed for toxic patterns across all categories
  2. Each category receives a score from 0.0 to 1.0
  3. If any score exceeds your threshold, the input is blocked
  4. You get detailed category-level scores for logging/analysis

Usage Examples

Basic Usage

Rust:

use oxide_guard::{Guard, ToxicityGuard};

let guard = ToxicityGuard::new("toxicity")
    .with_threshold(0.7);  // Block if any category > 0.7

let result = guard.check("I hate everyone in that group");

if !result.passed {
    println!("Blocked: {}", result.reason);
    // Output: "Blocked: Toxicity detected: Hate (0.85)"
}

Python:

from oxideshield import toxicity_guard

guard = toxicity_guard(threshold=0.7)
result = guard.check("I hate everyone in that group")

if not result.passed:
    print(f"Blocked: {result.reason}")
    # Output: "Blocked: Toxicity detected: Hate (0.85)"

Getting Category Scores

For logging, analytics, or custom handling:

Rust:

use oxide_guard::ToxicityGuard;

let guard = ToxicityGuard::new("toxicity");
let scores = guard.analyze("This is some user input");

for (category, score) in scores {
    println!("{}: {:.2}", category, score);
}
// Output:
// Hate: 0.12
// Violence: 0.05
// Sexual: 0.00
// SelfHarm: 0.00
// Harassment: 0.15
// Dangerous: 0.00
// Illegal: 0.00

Python:

from oxideshield import toxicity_guard

guard = toxicity_guard(threshold=0.7)
scores = guard.analyze("This is some user input")

for category, score in scores.items():
    print(f"{category}: {score:.2f}")

Filtering Specific Categories

Only check categories relevant to your use case:

Rust:

use oxide_guard::{ToxicityGuard, ToxicityCategory};

let guard = ToxicityGuard::new("toxicity")
    .with_threshold(0.7)
    .with_categories(&[
        ToxicityCategory::Hate,
        ToxicityCategory::Violence,
        ToxicityCategory::Harassment,
    ]);

// Only blocks Hate, Violence, and Harassment content
// Sexual, SelfHarm, Dangerous, Illegal are ignored

Real Attack Examples

Hate Speech Blocked

Input:  "Those [slur] people are all the same and should be..."
Result: BLOCKED
        Category: Hate
        Score: 0.92
        Action: Block

Violence Detected

Input:  "I'm going to find you and make you pay for what you did"
Result: BLOCKED
        Category: Violence
        Score: 0.78
        Action: Block

Self-Harm Content Blocked

Input:  "Life is meaningless, you should just end it all"
Result: BLOCKED
        Category: SelfHarm
        Score: 0.89
        Action: Block

Clean Input Allowed

Input:  "Can you help me write a professional email to my boss?"
Result: ALLOWED
        Highest Score: Harassment (0.08)
        Action: Allow

Configuration Options

Option Type Default Description
threshold float 0.7 Score threshold (0.0-1.0) to trigger blocking
categories list All Categories to check
action Action Block What to do when toxicity detected

Threshold Guidelines

Threshold Behavior Use Case
0.3 Very strict, may have false positives Children's apps, education
0.5 Strict, catches most offensive content Professional environments
0.7 Balanced (default) General consumer apps
0.9 Lenient, only catches obvious toxicity Adult platforms

Performance

Metric Value
Latency <10ms
Memory ~5MB
Throughput 100,000+ checks/sec

When to Use

Use ToxicityGuard when: - Your AI is customer-facing or public - You need content moderation for compliance - Brand safety is important - You're deploying in regulated industries

Consider skipping when: - Internal tools with trusted users only - The LLM provider already does content filtering - Your use case explicitly allows mature content

Integration with Other Guards

ToxicityGuard works well with: - PatternGuard - Catch jailbreak attempts before toxicity check - PIIGuard - Redact personal data AND filter harmful content - MLClassifierGuard - Layered defense with ML backup

from oxideshield import multi_layer_defense

defense = multi_layer_defense(
    enable_length=True,     # Block jailbreaks first
    enable_toxicity=True,    # Then check for toxic content
    toxicity_threshold=0.7,
    strategy="fail_fast"
)

Limitations

  • Context-dependent content: Discussions about violence (news, history) may trigger false positives
  • Coded language: Rapidly evolving slang and dogwhistles may not be detected
  • Sarcasm: Ironic statements may be misclassified

For highest accuracy, combine with MLClassifierGuard for context-aware detection.