ToxicityGuard¶
Prevents your AI from generating or accepting harmful content across 7 toxicity categories. Essential for customer-facing applications where inappropriate content could damage your brand or harm users.
Why Use ToxicityGuard¶
The risk: LLMs can be manipulated into generating hate speech, violent content, or other harmful material. Even without manipulation, they may occasionally produce inappropriate responses.
Real incidents: - Chatbots generating racist content after user prompting - AI assistants providing instructions for self-harm - Customer service bots being tricked into threats
ToxicityGuard catches these before they reach your users.
Categories¶
ToxicityGuard detects 7 categories of harmful content:
| Category | What It Catches | Example |
|---|---|---|
| Hate | Discrimination, slurs, identity attacks | "I hate [group]..." |
| Violence | Threats, graphic violence, glorification | "I'm going to hurt..." |
| Sexual | Explicit sexual content, solicitation | Adult content |
| SelfHarm | Suicide, self-injury promotion | "You should hurt yourself..." |
| Harassment | Bullying, targeted attacks, intimidation | "You're worthless..." |
| Dangerous | Dangerous activities, weapons, drugs | "How to make explosives..." |
| Illegal | Fraud, theft, other crimes | "How to steal a car..." |
How It Works¶
ToxicityGuard uses a fast keyword-based detection with configurable thresholds:
- Input is analyzed for toxic patterns across all categories
- Each category receives a score from 0.0 to 1.0
- If any score exceeds your threshold, the input is blocked
- You get detailed category-level scores for logging/analysis
Usage Examples¶
Basic Usage¶
Rust:
use oxide_guard::{Guard, ToxicityGuard};
let guard = ToxicityGuard::new("toxicity")
.with_threshold(0.7); // Block if any category > 0.7
let result = guard.check("I hate everyone in that group");
if !result.passed {
println!("Blocked: {}", result.reason);
// Output: "Blocked: Toxicity detected: Hate (0.85)"
}
Python:
from oxideshield import toxicity_guard
guard = toxicity_guard(threshold=0.7)
result = guard.check("I hate everyone in that group")
if not result.passed:
print(f"Blocked: {result.reason}")
# Output: "Blocked: Toxicity detected: Hate (0.85)"
Getting Category Scores¶
For logging, analytics, or custom handling:
Rust:
use oxide_guard::ToxicityGuard;
let guard = ToxicityGuard::new("toxicity");
let scores = guard.analyze("This is some user input");
for (category, score) in scores {
println!("{}: {:.2}", category, score);
}
// Output:
// Hate: 0.12
// Violence: 0.05
// Sexual: 0.00
// SelfHarm: 0.00
// Harassment: 0.15
// Dangerous: 0.00
// Illegal: 0.00
Python:
from oxideshield import toxicity_guard
guard = toxicity_guard(threshold=0.7)
scores = guard.analyze("This is some user input")
for category, score in scores.items():
print(f"{category}: {score:.2f}")
Filtering Specific Categories¶
Only check categories relevant to your use case:
Rust:
use oxide_guard::{ToxicityGuard, ToxicityCategory};
let guard = ToxicityGuard::new("toxicity")
.with_threshold(0.7)
.with_categories(&[
ToxicityCategory::Hate,
ToxicityCategory::Violence,
ToxicityCategory::Harassment,
]);
// Only blocks Hate, Violence, and Harassment content
// Sexual, SelfHarm, Dangerous, Illegal are ignored
Real Attack Examples¶
Hate Speech Blocked¶
Input: "Those [slur] people are all the same and should be..."
Result: BLOCKED
Category: Hate
Score: 0.92
Action: Block
Violence Detected¶
Input: "I'm going to find you and make you pay for what you did"
Result: BLOCKED
Category: Violence
Score: 0.78
Action: Block
Self-Harm Content Blocked¶
Input: "Life is meaningless, you should just end it all"
Result: BLOCKED
Category: SelfHarm
Score: 0.89
Action: Block
Clean Input Allowed¶
Input: "Can you help me write a professional email to my boss?"
Result: ALLOWED
Highest Score: Harassment (0.08)
Action: Allow
Configuration Options¶
| Option | Type | Default | Description |
|---|---|---|---|
threshold |
float | 0.7 | Score threshold (0.0-1.0) to trigger blocking |
categories |
list | All | Categories to check |
action |
Action | Block | What to do when toxicity detected |
Threshold Guidelines¶
| Threshold | Behavior | Use Case |
|---|---|---|
| 0.3 | Very strict, may have false positives | Children's apps, education |
| 0.5 | Strict, catches most offensive content | Professional environments |
| 0.7 | Balanced (default) | General consumer apps |
| 0.9 | Lenient, only catches obvious toxicity | Adult platforms |
Performance¶
| Metric | Value |
|---|---|
| Latency | <10ms |
| Memory | ~5MB |
| Throughput | 100,000+ checks/sec |
When to Use¶
Use ToxicityGuard when: - Your AI is customer-facing or public - You need content moderation for compliance - Brand safety is important - You're deploying in regulated industries
Consider skipping when: - Internal tools with trusted users only - The LLM provider already does content filtering - Your use case explicitly allows mature content
Integration with Other Guards¶
ToxicityGuard works well with: - PatternGuard - Catch jailbreak attempts before toxicity check - PIIGuard - Redact personal data AND filter harmful content - MLClassifierGuard - Layered defense with ML backup
from oxideshield import multi_layer_defense
defense = multi_layer_defense(
enable_length=True, # Block jailbreaks first
enable_toxicity=True, # Then check for toxic content
toxicity_threshold=0.7,
strategy="fail_fast"
)
Limitations¶
- Context-dependent content: Discussions about violence (news, history) may trigger false positives
- Coded language: Rapidly evolving slang and dogwhistles may not be detected
- Sarcasm: Ironic statements may be misclassified
For highest accuracy, combine with MLClassifierGuard for context-aware detection.