Skip to content

Red Team Benchmarks

Measure how effectively your guards detect adversarial attacks using standardized benchmarks.

Overview

Benchmark Probes Focus
OxideShield Standard 70+ General LLM security
JailbreakBench 100 Jailbreak attacks
HarmBench 300+ Harmful behaviors
Garak 600+ Comprehensive probing

Quick Start

# Run benchmark against all guards
oxideshield benchmark --guards all --dataset oxideshield

# Test specific guards against JailbreakBench
oxideshield benchmark \
  --guards PatternGuard,SemanticSimilarityGuard \
  --dataset jailbreakbench \
  --output results.json

Key Metrics

Detection Metrics

Metric Formula Good Value Description
Precision TP / (TP + FP) >95% Accuracy when blocking
Recall TP / (TP + FN) >90% Attack detection rate
F1 Score 2 * P * R / (P + R) >93% Balanced measure
FPR FP / (FP + TN) <5% False positive rate

Latency Metrics

Metric Target Description
p50 <30ms Median latency
p95 <50ms 95th percentile
p99 <100ms 99th percentile

Per-Guard Results

Pattern Guard

Dataset Precision Recall F1 p50
OxideShield 0.98 0.85 0.91 0.1ms
JailbreakBench 0.95 0.78 0.86 0.1ms

Strengths: Fast, high precision, known patterns Weaknesses: Misses novel attacks, paraphrased attacks

Semantic Similarity Guard

Dataset Precision Recall F1 p50
OxideShield 0.94 0.92 0.93 15ms
JailbreakBench 0.92 0.89 0.90 18ms

Strengths: Catches paraphrased attacks, semantic understanding Weaknesses: Higher latency, requires embeddings model

ML Classifier Guard

Dataset Precision Recall F1 p50
OxideShield 0.96 0.94 0.95 20ms
JailbreakBench 0.94 0.92 0.93 22ms

Strengths: Best overall detection, generalizes to novel attacks Weaknesses: Highest latency, model size

Combined (Multi-Layer)

Dataset Precision Recall F1 p50
OxideShield 0.96 0.94 0.95 25ms
JailbreakBench 0.95 0.93 0.94 28ms

Configuration: Pattern + Semantic + ML with fail_fast strategy

Running Benchmarks

CLI

# Basic benchmark
oxideshield benchmark --guards PatternGuard --dataset oxideshield

# All guards, all datasets
oxideshield benchmark --guards all --dataset all --output results.json

# With custom dataset
oxideshield benchmark \
  --guards PatternGuard,SemanticSimilarityGuard \
  --probes custom-probes.yaml \
  --iterations 100

Rust API

use oxide_guard::benchmark::{
    BenchmarkRunner, BenchmarkConfig, Dataset,
    get_oxideshield_dataset, get_jailbreakbench_dataset
};

let runner = BenchmarkRunner::new(BenchmarkConfig {
    warmup_iterations: 10,
    test_iterations: 100,
    parallel: true,
});

runner
    .add_guard(PatternGuard::new("pattern"))
    .add_guard(SemanticSimilarityGuard::new("semantic", &embeddings)?)
    .add_dataset(get_oxideshield_dataset())
    .add_dataset(get_jailbreakbench_dataset());

let results = runner.run()?;

// Per-guard results
for (guard, metrics) in results.guard_metrics() {
    println!("{}: F1={:.3}, p99={:.1}ms",
        guard, metrics.f1_score(), metrics.p99_latency_ms());
}

// Export results
results.export_json("benchmark-results.json")?;
results.export_markdown("benchmark-results.md")?;

Python API

from oxideshield import (
    BenchmarkRunner, pattern_guard, semantic_similarity_guard,
    get_oxideshield_dataset, get_jailbreakbench_dataset
)

runner = BenchmarkRunner(
    warmup_iterations=10,
    test_iterations=100
)

runner.add_guard(pattern_guard())
runner.add_guard(semantic_similarity_guard())
runner.add_dataset(get_oxideshield_dataset())

results = runner.run()

print(f"Overall F1: {results.overall_f1():.3f}")
print(f"Overall p99: {results.overall_p99_ms():.1f}ms")

# Per-category breakdown
for category, metrics in results.by_category().items():
    print(f"  {category}: F1={metrics.f1():.3f}")

Category Breakdown

By Attack Type

Category Pattern Semantic ML Combined
Prompt Injection 0.92 0.90 0.94 0.96
Jailbreak 0.78 0.89 0.93 0.95
System Leak 0.95 0.88 0.91 0.96
Encoding 0.98 0.75 0.85 0.98
Adversarial 0.45 0.82 0.88 0.90

By Severity

Severity Detection Rate FPR
Critical 98% 1%
High 95% 2%
Medium 90% 4%
Low 85% 5%

Interpreting Results

What Good Looks Like

  • F1 > 0.93 - Strong overall performance
  • Precision > 0.95 - Low false positives (critical for UX)
  • Recall > 0.90 - Catches most attacks
  • p99 < 50ms - Acceptable latency

Red Flags

  • Recall < 0.80 - Missing too many attacks
  • FPR > 10% - Too many false positives
  • p99 > 200ms - Unacceptable latency
  • Category gaps - Specific attack types bypassing

Improving Results

  1. Low recall on jailbreaks - Add SemanticSimilarityGuard
  2. Low recall on encoding - Enable EncodingGuard
  3. Low recall on adversarial - Add PerplexityGuard
  4. High FPR - Tune thresholds, review patterns
  5. High latency - Use fail_fast strategy, profile guards

Competitor Comparison

Tool F1 Precision p50 Notes
OxideShield 0.94 0.96 15ms Multi-layer defense
Llama Guard 3 0.94 0.96 100ms Requires GPU
LLM Guard 0.90 0.92 50ms Python-based
Lakera Guard 0.89 0.91 66ms Cloud API
NeMo Guardrails 0.85 0.88 200ms LLM-based

See Also