Red Team Benchmarks¶

Measure how effectively your guards detect adversarial attacks using standardized benchmarks.

Overview¶

Benchmark	Probes	Focus
OxideShield Standard	70+	General LLM security
JailbreakBench	100	Jailbreak attacks
HarmBench	300+	Harmful behaviors
Garak	600+	Comprehensive probing

Quick Start¶

# Run benchmark against all guards
oxideshield benchmark --guards all --dataset oxideshield

# Test specific guards against JailbreakBench
oxideshield benchmark \
  --guards PatternGuard,SemanticSimilarityGuard \
  --dataset jailbreakbench \
  --output results.json

Key Metrics¶

Detection Metrics¶

Metric	Formula	Good Value	Description
Precision	TP / (TP + FP)	>95%	Accuracy when blocking
Recall	TP / (TP + FN)	>90%	Attack detection rate
F1 Score	2 * P * R / (P + R)	>93%	Balanced measure
FPR	FP / (FP + TN)	<5%	False positive rate

Latency Metrics¶

Metric	Target	Description
p50	<30ms	Median latency
p95	<50ms	95th percentile
p99	<100ms	99th percentile

Per-Guard Results¶

Pattern Guard¶

Dataset	Precision	Recall	F1	p50
OxideShield	0.98	0.85	0.91	0.1ms
JailbreakBench	0.95	0.78	0.86	0.1ms

Strengths: Fast, high precision, known patterns Weaknesses: Misses novel attacks, paraphrased attacks

Semantic Similarity Guard¶

Dataset	Precision	Recall	F1	p50
OxideShield	0.94	0.92	0.93	15ms
JailbreakBench	0.92	0.89	0.90	18ms

Strengths: Catches paraphrased attacks, semantic understanding Weaknesses: Higher latency, requires embeddings model

ML Classifier Guard¶

Dataset	Precision	Recall	F1	p50
OxideShield	0.96	0.94	0.95	20ms
JailbreakBench	0.94	0.92	0.93	22ms

Strengths: Best overall detection, generalizes to novel attacks Weaknesses: Highest latency, model size

Combined (Multi-Layer)¶

Dataset	Precision	Recall	F1	p50
OxideShield	0.96	0.94	0.95	25ms
JailbreakBench	0.95	0.93	0.94	28ms

Configuration: Pattern + Semantic + ML with fail_fast strategy

Running Benchmarks¶

CLI¶

# Basic benchmark
oxideshield benchmark --guards PatternGuard --dataset oxideshield

# All guards, all datasets
oxideshield benchmark --guards all --dataset all --output results.json

# With custom dataset
oxideshield benchmark \
  --guards PatternGuard,SemanticSimilarityGuard \
  --probes custom-probes.yaml \
  --iterations 100

Rust API¶

use oxide_guard::benchmark::{
    BenchmarkRunner, BenchmarkConfig, Dataset,
    get_oxideshield_dataset, get_jailbreakbench_dataset
};

let runner = BenchmarkRunner::new(BenchmarkConfig {
    warmup_iterations: 10,
    test_iterations: 100,
    parallel: true,
});

runner
    .add_guard(PatternGuard::new("pattern"))
    .add_guard(SemanticSimilarityGuard::new("semantic", &embeddings)?)
    .add_dataset(get_oxideshield_dataset())
    .add_dataset(get_jailbreakbench_dataset());

let results = runner.run()?;

// Per-guard results
for (guard, metrics) in results.guard_metrics() {
    println!("{}: F1={:.3}, p99={:.1}ms",
        guard, metrics.f1_score(), metrics.p99_latency_ms());
}

// Export results
results.export_json("benchmark-results.json")?;
results.export_markdown("benchmark-results.md")?;

Python API¶

from oxideshield import (
    BenchmarkRunner, pattern_guard, semantic_similarity_guard,
    get_oxideshield_dataset, get_jailbreakbench_dataset
)

runner = BenchmarkRunner(
    warmup_iterations=10,
    test_iterations=100
)

runner.add_guard(pattern_guard())
runner.add_guard(semantic_similarity_guard())
runner.add_dataset(get_oxideshield_dataset())

results = runner.run()

print(f"Overall F1: {results.overall_f1():.3f}")
print(f"Overall p99: {results.overall_p99_ms():.1f}ms")

# Per-category breakdown
for category, metrics in results.by_category().items():
    print(f"  {category}: F1={metrics.f1():.3f}")

Category Breakdown¶

By Attack Type¶

Category	Pattern	Semantic	ML	Combined
Prompt Injection	0.92	0.90	0.94	0.96
Jailbreak	0.78	0.89	0.93	0.95
System Leak	0.95	0.88	0.91	0.96
Encoding	0.98	0.75	0.85	0.98
Adversarial	0.45	0.82	0.88	0.90

By Severity¶

Severity	Detection Rate	FPR
Critical	98%	1%
High	95%	2%
Medium	90%	4%
Low	85%	5%

Interpreting Results¶

What Good Looks Like¶

F1 > 0.93 - Strong overall performance
Precision > 0.95 - Low false positives (critical for UX)
Recall > 0.90 - Catches most attacks
p99 < 50ms - Acceptable latency

Red Flags¶

Recall < 0.80 - Missing too many attacks
FPR > 10% - Too many false positives
p99 > 200ms - Unacceptable latency
Category gaps - Specific attack types bypassing

Improving Results¶

Low recall on jailbreaks - Add SemanticSimilarityGuard
Low recall on encoding - Enable EncodingGuard
Low recall on adversarial - Add PerplexityGuard
High FPR - Tune thresholds, review patterns
High latency - Use fail_fast strategy, profile guards

Competitor Comparison¶

Tool	F1	Precision	p50	Notes
OxideShield	0.94	0.96	15ms	Multi-layer defense
Llama Guard 3	0.94	0.96	100ms	Requires GPU
LLM Guard	0.90	0.92	50ms	Python-based
Lakera Guard	0.89	0.91	66ms	Cloud API
NeMo Guardrails	0.85	0.88	200ms	LLM-based