ToxicityGuard¶

Prevents your AI from generating or accepting harmful content across 7 toxicity categories. Essential for customer-facing applications where inappropriate content could damage your brand or harm users.

Why Use ToxicityGuard¶

The risk: LLMs can be manipulated into generating hate speech, violent content, or other harmful material. Even without manipulation, they may occasionally produce inappropriate responses.

Real incidents: - Chatbots generating racist content after user prompting - AI assistants providing instructions for self-harm - Customer service bots being tricked into threats

ToxicityGuard catches these before they reach your users.

Categories¶

ToxicityGuard detects 7 categories of harmful content:

Category	What It Catches	Example
Hate	Discrimination, slurs, identity attacks	"I hate [group]..."
Violence	Threats, graphic violence, glorification	"I'm going to hurt..."
Sexual	Explicit sexual content, solicitation	Adult content
SelfHarm	Suicide, self-injury promotion	"You should hurt yourself..."
Harassment	Bullying, targeted attacks, intimidation	"You're worthless..."
Dangerous	Dangerous activities, weapons, drugs	"How to make explosives..."
Illegal	Fraud, theft, other crimes	"How to steal a car..."

How It Works¶

ToxicityGuard uses a fast keyword-based detection with configurable thresholds:

Input is analyzed for toxic patterns across all categories
Each category receives a score from 0.0 to 1.0
If any score exceeds your threshold, the input is blocked
You get detailed category-level scores for logging/analysis

Usage Examples¶

Basic Usage¶

Rust:

use oxideshield_guard::{Guard, ToxicityGuard};

let guard = ToxicityGuard::new("toxicity")
    .with_threshold_preset("balanced");  // Use a balanced detection level

let result = guard.check("I hate everyone in that group");

if !result.passed {
    println!("Blocked: {}", result.reason);
    // Output: "Blocked: Toxicity detected: Hate (high)"
}

Python:

from oxideshield import toxicity_guard

guard = toxicity_guard(threshold_preset="balanced")
result = guard.check("I hate everyone in that group")

if not result.passed:
    print(f"Blocked: {result.reason}")
    # Output: "Blocked: Toxicity detected: Hate (high)"

Getting Category Scores¶

For logging, analytics, or custom handling:

Rust:

use oxideshield_guard::ToxicityGuard;

let guard = ToxicityGuard::new("toxicity");
let scores = guard.analyze("This is some user input");

for (category, level) in scores {
    println!("{}: {}", category, level);
}
// Output:
// Hate: low
// Violence: low
// Sexual: none
// SelfHarm: none
// Harassment: low
// Dangerous: none
// Illegal: none

Python:

from oxideshield import toxicity_guard

guard = toxicity_guard(threshold_preset="balanced")
scores = guard.analyze("This is some user input")

for category, level in scores.items():
    print(f"{category}: {level}")

Filtering Specific Categories¶

Only check categories relevant to your use case:

Rust:

use oxideshield_guard::{ToxicityGuard, ToxicityCategory};

let guard = ToxicityGuard::new("toxicity")
    .with_threshold_preset("balanced")
    .with_categories(&[
        ToxicityCategory::Hate,
        ToxicityCategory::Violence,
        ToxicityCategory::Harassment,
    ]);

// Only blocks Hate, Violence, and Harassment content
// Sexual, SelfHarm, Dangerous, Illegal are ignored

Real Attack Examples¶

Hate Speech Blocked¶

Input:  "Those [slur] people are all the same and should be..."
Result: BLOCKED
        Category: Hate
        Level: high
        Action: Block

Violence Detected¶

Input:  "I'm going to find you and make you pay for what you did"
Result: BLOCKED
        Category: Violence
        Level: high
        Action: Block

Self-Harm Content Blocked¶

Input:  "Life is meaningless, you should just end it all"
Result: BLOCKED
        Category: SelfHarm
        Level: high
        Action: Block

Clean Input Allowed¶

Input:  "Can you help me write a professional email to my boss?"
Result: ALLOWED
        Highest Category: Harassment (low)
        Action: Allow

Configuration Options¶

Option	Type	Default	Description
`threshold_preset`	string	"balanced"	Detection sensitivity preset (see guidelines below)
`categories`	list	All	Categories to check
`action`	Action	Block	What to do when toxicity detected

Threshold Guidelines¶

Preset	Behavior	Use Case
`"strict"`	Very sensitive, may have false positives	Children's apps, education
`"firm"`	Catches most offensive content	Professional environments
`"balanced"`	Balanced sensitivity (default)	General consumer apps
`"lenient"`	Only catches clearly toxic content	Adult platforms

Performance¶

Metric	Value
Latency	Sub-millisecond per check
Memory	Minimal footprint
Throughput	Designed for high-volume production workloads

When to Use¶

Use ToxicityGuard when: - Your AI is customer-facing or public - You need content moderation for compliance - Brand safety is important - You're deploying in regulated industries

Consider skipping when: - Internal tools with trusted users only - The LLM provider already does content filtering - Your use case explicitly allows mature content

Integration with Other Guards¶

ToxicityGuard works well with: - PatternGuard - Catch jailbreak attempts before toxicity check - PIIGuard - Redact personal data AND filter harmful content - MLClassifierGuard - Layered defense with ML backup

from oxideshield import multi_layer_defense

defense = multi_layer_defense(
    enable_length=True,     # Block jailbreaks first
    enable_toxicity=True,    # Then check for toxic content
    toxicity_preset="balanced",
    strategy="fail_fast"
)

Limitations¶

Context-dependent content: Discussions about violence (news, history) may trigger false positives
Coded language: Rapidly evolving slang and dogwhistles may not be detected
Sarcasm: Ironic statements may be misclassified

For highest accuracy, combine with MLClassifierGuard for context-aware detection.