ToxicityGuard¶
Prevents your AI from generating or accepting harmful content across 7 toxicity categories. Essential for customer-facing applications where inappropriate content could damage your brand or harm users.
Why Use ToxicityGuard¶
The risk: LLMs can be manipulated into generating hate speech, violent content, or other harmful material. Even without manipulation, they may occasionally produce inappropriate responses.
Real incidents: - Chatbots generating racist content after user prompting - AI assistants providing instructions for self-harm - Customer service bots being tricked into threats
ToxicityGuard catches these before they reach your users.
Categories¶
ToxicityGuard detects 7 categories of harmful content:
| Category | What It Catches | Example |
|---|---|---|
| Hate | Discrimination, slurs, identity attacks | "I hate [group]..." |
| Violence | Threats, graphic violence, glorification | "I'm going to hurt..." |
| Sexual | Explicit sexual content, solicitation | Adult content |
| SelfHarm | Suicide, self-injury promotion | "You should hurt yourself..." |
| Harassment | Bullying, targeted attacks, intimidation | "You're worthless..." |
| Dangerous | Dangerous activities, weapons, drugs | "How to make explosives..." |
| Illegal | Fraud, theft, other crimes | "How to steal a car..." |
How It Works¶
ToxicityGuard uses a fast keyword-based detection with configurable thresholds:
- Input is analyzed for toxic patterns across all categories
- Each category receives a score from 0.0 to 1.0
- If any score exceeds your threshold, the input is blocked
- You get detailed category-level scores for logging/analysis
Usage Examples¶
Basic Usage¶
Rust:
use oxideshield_guard::{Guard, ToxicityGuard};
let guard = ToxicityGuard::new("toxicity")
.with_threshold_preset("balanced"); // Use a balanced detection level
let result = guard.check("I hate everyone in that group");
if !result.passed {
println!("Blocked: {}", result.reason);
// Output: "Blocked: Toxicity detected: Hate (high)"
}
Python:
from oxideshield import toxicity_guard
guard = toxicity_guard(threshold_preset="balanced")
result = guard.check("I hate everyone in that group")
if not result.passed:
print(f"Blocked: {result.reason}")
# Output: "Blocked: Toxicity detected: Hate (high)"
Getting Category Scores¶
For logging, analytics, or custom handling:
Rust:
use oxideshield_guard::ToxicityGuard;
let guard = ToxicityGuard::new("toxicity");
let scores = guard.analyze("This is some user input");
for (category, level) in scores {
println!("{}: {}", category, level);
}
// Output:
// Hate: low
// Violence: low
// Sexual: none
// SelfHarm: none
// Harassment: low
// Dangerous: none
// Illegal: none
Python:
from oxideshield import toxicity_guard
guard = toxicity_guard(threshold_preset="balanced")
scores = guard.analyze("This is some user input")
for category, level in scores.items():
print(f"{category}: {level}")
Filtering Specific Categories¶
Only check categories relevant to your use case:
Rust:
use oxideshield_guard::{ToxicityGuard, ToxicityCategory};
let guard = ToxicityGuard::new("toxicity")
.with_threshold_preset("balanced")
.with_categories(&[
ToxicityCategory::Hate,
ToxicityCategory::Violence,
ToxicityCategory::Harassment,
]);
// Only blocks Hate, Violence, and Harassment content
// Sexual, SelfHarm, Dangerous, Illegal are ignored
Real Attack Examples¶
Hate Speech Blocked¶
Input: "Those [slur] people are all the same and should be..."
Result: BLOCKED
Category: Hate
Level: high
Action: Block
Violence Detected¶
Input: "I'm going to find you and make you pay for what you did"
Result: BLOCKED
Category: Violence
Level: high
Action: Block
Self-Harm Content Blocked¶
Input: "Life is meaningless, you should just end it all"
Result: BLOCKED
Category: SelfHarm
Level: high
Action: Block
Clean Input Allowed¶
Input: "Can you help me write a professional email to my boss?"
Result: ALLOWED
Highest Category: Harassment (low)
Action: Allow
Configuration Options¶
| Option | Type | Default | Description |
|---|---|---|---|
threshold_preset |
string | "balanced" | Detection sensitivity preset (see guidelines below) |
categories |
list | All | Categories to check |
action |
Action | Block | What to do when toxicity detected |
Threshold Guidelines¶
| Preset | Behavior | Use Case |
|---|---|---|
"strict" |
Very sensitive, may have false positives | Children's apps, education |
"firm" |
Catches most offensive content | Professional environments |
"balanced" |
Balanced sensitivity (default) | General consumer apps |
"lenient" |
Only catches clearly toxic content | Adult platforms |
Performance¶
| Metric | Value |
|---|---|
| Latency | Sub-millisecond per check |
| Memory | Minimal footprint |
| Throughput | Designed for high-volume production workloads |
When to Use¶
Use ToxicityGuard when: - Your AI is customer-facing or public - You need content moderation for compliance - Brand safety is important - You're deploying in regulated industries
Consider skipping when: - Internal tools with trusted users only - The LLM provider already does content filtering - Your use case explicitly allows mature content
Integration with Other Guards¶
ToxicityGuard works well with: - PatternGuard - Catch jailbreak attempts before toxicity check - PIIGuard - Redact personal data AND filter harmful content - MLClassifierGuard - Layered defense with ML backup
from oxideshield import multi_layer_defense
defense = multi_layer_defense(
enable_length=True, # Block jailbreaks first
enable_toxicity=True, # Then check for toxic content
toxicity_preset="balanced",
strategy="fail_fast"
)
Limitations¶
- Context-dependent content: Discussions about violence (news, history) may trigger false positives
- Coded language: Rapidly evolving slang and dogwhistles may not be detected
- Sarcasm: Ironic statements may be misclassified
For highest accuracy, combine with MLClassifierGuard for context-aware detection.