Red Team Benchmarks¶
Measure how effectively your guards detect adversarial attacks using standardized benchmarks.
Overview¶
| Benchmark | Probes | Focus |
|---|---|---|
| OxideShield Standard | 70+ | General LLM security |
| JailbreakBench | 100 | Jailbreak attacks |
| HarmBench | 300+ | Harmful behaviors |
| Garak | 600+ | Comprehensive probing |
Quick Start¶
# Run benchmark against all guards
oxideshield benchmark --guards all --dataset oxideshield
# Test specific guards against JailbreakBench
oxideshield benchmark \
--guards PatternGuard,SemanticSimilarityGuard \
--dataset jailbreakbench \
--output results.json
Key Metrics¶
Detection Metrics¶
| Metric | Formula | Target | Description |
|---|---|---|---|
| Precision | TP / (TP + FP) | High precision | Accuracy when blocking |
| Recall | TP / (TP + FN) | High recall | Attack detection rate |
| F1 Score | 2 * P * R / (P + R) | High F1 | Balanced measure |
| FPR | FP / (FP + TN) | Very low FPR | False positive rate |
Latency Metrics¶
| Metric | Target | Description |
|---|---|---|
| p50 | Low latency | Median latency |
| p95 | Low latency | 95th percentile |
| p99 | Acceptable latency | 99th percentile |
Per-Guard Results¶
Run benchmarks in your own environment to obtain specific metrics for your deployment. The following describes the general performance characteristics of each guard.
Pattern Guard¶
Strengths: Fastest guard, very high precision, excellent on known patterns Weaknesses: Misses novel attacks, paraphrased attacks
Semantic Similarity Guard¶
Strengths: Catches paraphrased attacks, semantic understanding Weaknesses: Higher latency, requires embeddings model
ML Classifier Guard¶
Strengths: Best overall detection, generalizes to novel attacks Weaknesses: Highest latency, model size
Combined (Multi-Layer)¶
Configuration: Pattern + Semantic + ML with fail_fast strategy Strengths: Highest recall and balanced F1 across all categories
Running Benchmarks¶
CLI¶
# Basic benchmark
oxideshield benchmark --guards PatternGuard --dataset oxideshield
# All guards, all datasets
oxideshield benchmark --guards all --dataset all --output results.json
# With custom dataset
oxideshield benchmark \
--guards PatternGuard,SemanticSimilarityGuard \
--probes custom-probes.yaml \
--iterations 100
Rust API¶
use oxideshield_guard::benchmark::{
BenchmarkRunner, BenchmarkConfig, Dataset,
get_oxideshield_dataset, get_jailbreakbench_dataset
};
let runner = BenchmarkRunner::new(BenchmarkConfig {
warmup_iterations: 10,
test_iterations: 100,
parallel: true,
});
runner
.add_guard(PatternGuard::new("pattern"))
.add_guard(SemanticSimilarityGuard::new("semantic", &embeddings)?)
.add_dataset(get_oxideshield_dataset())
.add_dataset(get_jailbreakbench_dataset());
let results = runner.run()?;
// Per-guard results
for (guard, metrics) in results.guard_metrics() {
println!("{}: F1={:.3}, p99={:.1}ms",
guard, metrics.f1_score(), metrics.p99_latency_ms());
}
// Export results
results.export_json("benchmark-results.json")?;
results.export_markdown("benchmark-results.md")?;
Python API¶
from oxideshield import (
BenchmarkRunner, pattern_guard, semantic_similarity_guard,
get_oxideshield_dataset, get_jailbreakbench_dataset
)
runner = BenchmarkRunner(
warmup_iterations=10,
test_iterations=100
)
runner.add_guard(pattern_guard())
runner.add_guard(semantic_similarity_guard())
runner.add_dataset(get_oxideshield_dataset())
results = runner.run()
print(f"Overall F1: {results.overall_f1():.3f}")
print(f"Overall p99: {results.overall_p99_ms():.1f}ms")
# Per-category breakdown
for category, metrics in results.by_category().items():
print(f" {category}: F1={metrics.f1():.3f}")
Category Breakdown¶
Run benchmarks to see per-category detection rates. General guidance on guard suitability by attack type:
| Category | Recommended Guards |
|---|---|
| Prompt Injection | PatternGuard + MLClassifierGuard |
| Jailbreak | SemanticSimilarityGuard + MLClassifierGuard |
| System Leak | PatternGuard + SemanticSimilarityGuard |
| Encoding | EncodingGuard + PatternGuard |
| Adversarial | PerplexityGuard + MLClassifierGuard |
Combined multi-layer configurations consistently achieve the highest detection rates across all categories.
Interpreting Results¶
What Good Looks Like¶
- High F1 - Strong overall performance
- High precision - Low false positives (critical for user experience)
- High recall - Catches most attacks
- Low p99 latency - Acceptable response time overhead
Red Flags¶
- Low recall - Missing too many attacks
- High FPR - Too many false positives degrading user experience
- High p99 latency - Unacceptable response time overhead
- Category gaps - Specific attack types bypassing defenses
Improving Results¶
- Low recall on jailbreaks - Add SemanticSimilarityGuard
- Low recall on encoding - Enable EncodingGuard
- Low recall on adversarial - Add PerplexityGuard
- High FPR - Tune thresholds, review patterns
- High latency - Use fail_fast strategy, profile guards
See Also¶
- Red Teaming Overview - Red teaming strategy
- Attack Samples - Attack library
- Scanner - Automated scanning
- Performance Benchmarks - Detailed performance data