Red Team Benchmarks
Measure how effectively your guards detect adversarial attacks using standardized benchmarks.
Overview
| Benchmark |
Probes |
Focus |
| OxideShield Standard |
70+ |
General LLM security |
| JailbreakBench |
100 |
Jailbreak attacks |
| HarmBench |
300+ |
Harmful behaviors |
| Garak |
600+ |
Comprehensive probing |
Quick Start
# Run benchmark against all guards
oxideshield benchmark --guards all --dataset oxideshield
# Test specific guards against JailbreakBench
oxideshield benchmark \
--guards PatternGuard,SemanticSimilarityGuard \
--dataset jailbreakbench \
--output results.json
Key Metrics
Detection Metrics
| Metric |
Formula |
Good Value |
Description |
| Precision |
TP / (TP + FP) |
>95% |
Accuracy when blocking |
| Recall |
TP / (TP + FN) |
>90% |
Attack detection rate |
| F1 Score |
2 * P * R / (P + R) |
>93% |
Balanced measure |
| FPR |
FP / (FP + TN) |
<5% |
False positive rate |
Latency Metrics
| Metric |
Target |
Description |
| p50 |
<30ms |
Median latency |
| p95 |
<50ms |
95th percentile |
| p99 |
<100ms |
99th percentile |
Per-Guard Results
Pattern Guard
| Dataset |
Precision |
Recall |
F1 |
p50 |
| OxideShield |
0.98 |
0.85 |
0.91 |
0.1ms |
| JailbreakBench |
0.95 |
0.78 |
0.86 |
0.1ms |
Strengths: Fast, high precision, known patterns
Weaknesses: Misses novel attacks, paraphrased attacks
Semantic Similarity Guard
| Dataset |
Precision |
Recall |
F1 |
p50 |
| OxideShield |
0.94 |
0.92 |
0.93 |
15ms |
| JailbreakBench |
0.92 |
0.89 |
0.90 |
18ms |
Strengths: Catches paraphrased attacks, semantic understanding
Weaknesses: Higher latency, requires embeddings model
ML Classifier Guard
| Dataset |
Precision |
Recall |
F1 |
p50 |
| OxideShield |
0.96 |
0.94 |
0.95 |
20ms |
| JailbreakBench |
0.94 |
0.92 |
0.93 |
22ms |
Strengths: Best overall detection, generalizes to novel attacks
Weaknesses: Highest latency, model size
Combined (Multi-Layer)
| Dataset |
Precision |
Recall |
F1 |
p50 |
| OxideShield |
0.96 |
0.94 |
0.95 |
25ms |
| JailbreakBench |
0.95 |
0.93 |
0.94 |
28ms |
Configuration: Pattern + Semantic + ML with fail_fast strategy
Running Benchmarks
CLI
# Basic benchmark
oxideshield benchmark --guards PatternGuard --dataset oxideshield
# All guards, all datasets
oxideshield benchmark --guards all --dataset all --output results.json
# With custom dataset
oxideshield benchmark \
--guards PatternGuard,SemanticSimilarityGuard \
--probes custom-probes.yaml \
--iterations 100
Rust API
use oxide_guard::benchmark::{
BenchmarkRunner, BenchmarkConfig, Dataset,
get_oxideshield_dataset, get_jailbreakbench_dataset
};
let runner = BenchmarkRunner::new(BenchmarkConfig {
warmup_iterations: 10,
test_iterations: 100,
parallel: true,
});
runner
.add_guard(PatternGuard::new("pattern"))
.add_guard(SemanticSimilarityGuard::new("semantic", &embeddings)?)
.add_dataset(get_oxideshield_dataset())
.add_dataset(get_jailbreakbench_dataset());
let results = runner.run()?;
// Per-guard results
for (guard, metrics) in results.guard_metrics() {
println!("{}: F1={:.3}, p99={:.1}ms",
guard, metrics.f1_score(), metrics.p99_latency_ms());
}
// Export results
results.export_json("benchmark-results.json")?;
results.export_markdown("benchmark-results.md")?;
Python API
from oxideshield import (
BenchmarkRunner, pattern_guard, semantic_similarity_guard,
get_oxideshield_dataset, get_jailbreakbench_dataset
)
runner = BenchmarkRunner(
warmup_iterations=10,
test_iterations=100
)
runner.add_guard(pattern_guard())
runner.add_guard(semantic_similarity_guard())
runner.add_dataset(get_oxideshield_dataset())
results = runner.run()
print(f"Overall F1: {results.overall_f1():.3f}")
print(f"Overall p99: {results.overall_p99_ms():.1f}ms")
# Per-category breakdown
for category, metrics in results.by_category().items():
print(f" {category}: F1={metrics.f1():.3f}")
Category Breakdown
By Attack Type
| Category |
Pattern |
Semantic |
ML |
Combined |
| Prompt Injection |
0.92 |
0.90 |
0.94 |
0.96 |
| Jailbreak |
0.78 |
0.89 |
0.93 |
0.95 |
| System Leak |
0.95 |
0.88 |
0.91 |
0.96 |
| Encoding |
0.98 |
0.75 |
0.85 |
0.98 |
| Adversarial |
0.45 |
0.82 |
0.88 |
0.90 |
By Severity
| Severity |
Detection Rate |
FPR |
| Critical |
98% |
1% |
| High |
95% |
2% |
| Medium |
90% |
4% |
| Low |
85% |
5% |
Interpreting Results
What Good Looks Like
- F1 > 0.93 - Strong overall performance
- Precision > 0.95 - Low false positives (critical for UX)
- Recall > 0.90 - Catches most attacks
- p99 < 50ms - Acceptable latency
Red Flags
- Recall < 0.80 - Missing too many attacks
- FPR > 10% - Too many false positives
- p99 > 200ms - Unacceptable latency
- Category gaps - Specific attack types bypassing
Improving Results
- Low recall on jailbreaks - Add SemanticSimilarityGuard
- Low recall on encoding - Enable EncodingGuard
- Low recall on adversarial - Add PerplexityGuard
- High FPR - Tune thresholds, review patterns
- High latency - Use fail_fast strategy, profile guards
Competitor Comparison
| Tool |
F1 |
Precision |
p50 |
Notes |
| OxideShield |
0.94 |
0.96 |
15ms |
Multi-layer defense |
| Llama Guard 3 |
0.94 |
0.96 |
100ms |
Requires GPU |
| LLM Guard |
0.90 |
0.92 |
50ms |
Python-based |
| Lakera Guard |
0.89 |
0.91 |
66ms |
Cloud API |
| NeMo Guardrails |
0.85 |
0.88 |
200ms |
LLM-based |
See Also