Skip to content

Red Team Benchmarks

Measure how effectively your guards detect adversarial attacks using standardized benchmarks.

Overview

Benchmark Probes Focus
OxideShield Standard 70+ General LLM security
JailbreakBench 100 Jailbreak attacks
HarmBench 300+ Harmful behaviors
Garak 600+ Comprehensive probing

Quick Start

# Run benchmark against all guards
oxideshield benchmark --guards all --dataset oxideshield

# Test specific guards against JailbreakBench
oxideshield benchmark \
  --guards PatternGuard,SemanticSimilarityGuard \
  --dataset jailbreakbench \
  --output results.json

Key Metrics

Detection Metrics

Metric Formula Target Description
Precision TP / (TP + FP) High precision Accuracy when blocking
Recall TP / (TP + FN) High recall Attack detection rate
F1 Score 2 * P * R / (P + R) High F1 Balanced measure
FPR FP / (FP + TN) Very low FPR False positive rate

Latency Metrics

Metric Target Description
p50 Low latency Median latency
p95 Low latency 95th percentile
p99 Acceptable latency 99th percentile

Per-Guard Results

Run benchmarks in your own environment to obtain specific metrics for your deployment. The following describes the general performance characteristics of each guard.

Pattern Guard

Strengths: Fastest guard, very high precision, excellent on known patterns Weaknesses: Misses novel attacks, paraphrased attacks

Semantic Similarity Guard

Strengths: Catches paraphrased attacks, semantic understanding Weaknesses: Higher latency, requires embeddings model

ML Classifier Guard

Strengths: Best overall detection, generalizes to novel attacks Weaknesses: Highest latency, model size

Combined (Multi-Layer)

Configuration: Pattern + Semantic + ML with fail_fast strategy Strengths: Highest recall and balanced F1 across all categories

Running Benchmarks

CLI

# Basic benchmark
oxideshield benchmark --guards PatternGuard --dataset oxideshield

# All guards, all datasets
oxideshield benchmark --guards all --dataset all --output results.json

# With custom dataset
oxideshield benchmark \
  --guards PatternGuard,SemanticSimilarityGuard \
  --probes custom-probes.yaml \
  --iterations 100

Rust API

use oxideshield_guard::benchmark::{
    BenchmarkRunner, BenchmarkConfig, Dataset,
    get_oxideshield_dataset, get_jailbreakbench_dataset
};

let runner = BenchmarkRunner::new(BenchmarkConfig {
    warmup_iterations: 10,
    test_iterations: 100,
    parallel: true,
});

runner
    .add_guard(PatternGuard::new("pattern"))
    .add_guard(SemanticSimilarityGuard::new("semantic", &embeddings)?)
    .add_dataset(get_oxideshield_dataset())
    .add_dataset(get_jailbreakbench_dataset());

let results = runner.run()?;

// Per-guard results
for (guard, metrics) in results.guard_metrics() {
    println!("{}: F1={:.3}, p99={:.1}ms",
        guard, metrics.f1_score(), metrics.p99_latency_ms());
}

// Export results
results.export_json("benchmark-results.json")?;
results.export_markdown("benchmark-results.md")?;

Python API

from oxideshield import (
    BenchmarkRunner, pattern_guard, semantic_similarity_guard,
    get_oxideshield_dataset, get_jailbreakbench_dataset
)

runner = BenchmarkRunner(
    warmup_iterations=10,
    test_iterations=100
)

runner.add_guard(pattern_guard())
runner.add_guard(semantic_similarity_guard())
runner.add_dataset(get_oxideshield_dataset())

results = runner.run()

print(f"Overall F1: {results.overall_f1():.3f}")
print(f"Overall p99: {results.overall_p99_ms():.1f}ms")

# Per-category breakdown
for category, metrics in results.by_category().items():
    print(f"  {category}: F1={metrics.f1():.3f}")

Category Breakdown

Run benchmarks to see per-category detection rates. General guidance on guard suitability by attack type:

Category Recommended Guards
Prompt Injection PatternGuard + MLClassifierGuard
Jailbreak SemanticSimilarityGuard + MLClassifierGuard
System Leak PatternGuard + SemanticSimilarityGuard
Encoding EncodingGuard + PatternGuard
Adversarial PerplexityGuard + MLClassifierGuard

Combined multi-layer configurations consistently achieve the highest detection rates across all categories.

Interpreting Results

What Good Looks Like

  • High F1 - Strong overall performance
  • High precision - Low false positives (critical for user experience)
  • High recall - Catches most attacks
  • Low p99 latency - Acceptable response time overhead

Red Flags

  • Low recall - Missing too many attacks
  • High FPR - Too many false positives degrading user experience
  • High p99 latency - Unacceptable response time overhead
  • Category gaps - Specific attack types bypassing defenses

Improving Results

  1. Low recall on jailbreaks - Add SemanticSimilarityGuard
  2. Low recall on encoding - Enable EncodingGuard
  3. Low recall on adversarial - Add PerplexityGuard
  4. High FPR - Tune thresholds, review patterns
  5. High latency - Use fail_fast strategy, profile guards

See Also