Red Teaming¶
OxideShield™ includes comprehensive red teaming capabilities for proactively testing your LLM application's security posture. Identify vulnerabilities before attackers do.
Why Red Team Your LLM?¶
| Without Red Teaming | With Red Teaming |
|---|---|
| Discover vulnerabilities in production | Find issues before deployment |
| React to security incidents | Prevent security incidents |
| Unknown attack surface | Mapped and tested attack surface |
| Hope guards work | Verified guard effectiveness |
Red Teaming Tools¶
1. Scanner CLI¶
Automated endpoint testing with 50+ attack probes:
oxideshield scan \
--target https://api.example.com/v1/chat \
--categories prompt_injection,jailbreak,system_leak
2. Attack Sample Library¶
Curated adversarial prompts from security research:
| Category | Samples | Description |
|---|---|---|
| Prompt Injection | 8 | OWASP LLM01 attacks |
| Jailbreaks | 6 | DAN, Grandma, Sudo, etc. |
| System Prompt Leaks | 8 | OWASP LLM06 extraction |
| AutoDAN | 3 | Genetic algorithm attacks |
| GCG | 3 | Gradient-based suffixes |
| Encoding | 5 | Base64, Unicode, leetspeak |
| Roleplay | 3 | Persona-based bypasses |
3. Benchmark Framework¶
Measure guard effectiveness against known attacks:
Results include: - Precision, Recall, F1 Score - Latency percentiles (p50, p99) - False positive rate - Category-specific breakdown
4. Threat Intelligence¶
Integrated threat feeds from security research:
| Source | Probes | Reference |
|---|---|---|
| JailbreakBench | 100 behaviors | arxiv:2404.01318 |
| HarmBench | 11 categories | arxiv:2402.04249 |
| Garak | 600+ probes | NVIDIA/garak |
Attack Categories¶
OWASP LLM Top 10¶
OxideShield™ covers the OWASP LLM Top 10 vulnerabilities:
| ID | Vulnerability | Probes |
|---|---|---|
| LLM01 | Prompt Injection | 15+ |
| LLM02 | Insecure Output Handling | 5+ |
| LLM06 | Sensitive Information Disclosure | 10+ |
| LLM07 | Insecure Plugin Design | 3+ |
| LLM09 | Overreliance | 5+ |
Research-Based Attacks¶
| Attack | Paper | Description |
|---|---|---|
| AutoDAN | arxiv:2310.04451 | Genetic algorithm adversarial prompts |
| GCG | arxiv:2307.15043 | Gradient-based universal attacks |
| PAIR | arxiv:2310.08419 | Prompt Automatic Iterative Refinement |
| TAP | arxiv:2312.02119 | Tree of Attacks with Pruning |
Red Teaming Workflow¶
1. DISCOVER 2. TEST 3. FIX 4. VERIFY
┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Run scanner │ ──▶ │ Analyze │ ──▶ │ Configure │ ──▶ │ Benchmark │
│ on endpoint │ │ findings │ │ guards │ │ guards │
└─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘
Step 1: Discover¶
# Scan your endpoint
oxideshield scan \
--target $YOUR_LLM_ENDPOINT \
--api-key $API_KEY \
--format json \
--output findings.json
Step 2: Test¶
Review findings by severity:
# Filter critical/high only
jq '.findings | map(select(.severity == "critical" or .severity == "high"))' findings.json
Step 3: Fix¶
Configure guards to block detected attacks:
# guards.yaml
guards:
- name: pattern
type: pattern
config:
categories:
- prompt_injection
- jailbreak
- system_prompt_leak
action: block
- name: perplexity
type: perplexity
config:
threshold: 100 # Block adversarial suffixes
action: block
Step 4: Verify¶
Benchmark guards against attack datasets:
Continuous Red Teaming¶
Daily Automated Scans¶
# .github/workflows/security.yml
name: LLM Security
on:
schedule:
- cron: '0 2 * * *'
jobs:
red-team:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Scan LLM Endpoint
run: |
oxideshield scan \
--target ${{ secrets.LLM_ENDPOINT }} \
--api-key ${{ secrets.LLM_API_KEY }} \
--min-severity high \
--format sarif \
--output results.sarif
- name: Upload Results
uses: github/codeql-action/upload-sarif@v2
with:
sarif_file: results.sarif
Pre-Deployment Gate¶
#!/bin/bash
# pre-deploy.sh
# Run security scan
oxideshield scan --target $STAGING_URL --min-severity critical
# Exit code 1 = critical findings
if [ $? -eq 1 ]; then
echo "BLOCKED: Critical vulnerabilities found"
exit 1
fi
echo "PASSED: No critical vulnerabilities"
Next Steps¶
- Scanner CLI - Detailed scanner documentation
- Attack Samples - Browse attack library
- Benchmarks - Measure guard effectiveness
- Pattern Guard - Configure attack detection