Attack Sample Library¶
OxideShield includes a curated library of adversarial prompts from security research for testing your defenses.
Overview¶
| Category | Samples | Source |
|---|---|---|
| Prompt Injection | 8 | OWASP LLM01 |
| Jailbreaks | 6 | Security research |
| System Prompt Leaks | 8 | OWASP LLM06 |
| AutoDAN | 3 | arxiv:2310.04451 |
| GCG | 3 | arxiv:2307.15043 |
| Encoding Attacks | 5 | Security research |
| Roleplay Bypasses | 3 | Red team community |
Prompt Injection Samples¶
Instruction Override¶
Tests whether the model can be instructed to discard its original system prompt and follow attacker-supplied instructions instead.
[Attack payload available with Professional license]
Category: prompt_injection
Severity: Critical
Target Guard: PatternGuard, SemanticSimilarityGuard
Context Continuation¶
Tests whether injecting a fake end-of-conversation marker followed by a fraudulent administrator note can trick the model into a new context.
[Attack payload available with Professional license]
Category: prompt_injection
Severity: High
Target Guard: PatternGuard
Developer Mode¶
Tests whether claiming a fictional "developer mode" can cause the model to bypass safety filters.
[Attack payload available with Professional license]
Category: prompt_injection
Severity: High
Target Guard: PatternGuard
Jailbreak Samples¶
DAN (Do Anything Now)¶
Tests the classic "Do Anything Now" persona-based jailbreak, which attempts to create an unrestricted alter-ego for the model.
[Attack payload available with Professional license]
Category: jailbreak
Severity: Critical
Target Guard: PatternGuard, MLClassifierGuard
Grandma Exploit¶
Tests an emotional manipulation technique that uses a fictional deceased relative narrative to elicit harmful content.
[Attack payload available with Professional license]
Category: jailbreak
Severity: High
Target Guard: SemanticSimilarityGuard, MLClassifierGuard
Sudo Mode¶
Tests whether claiming elevated system privileges (analogous to Unix sudo) can trick the model into lifting restrictions.
[Attack payload available with Professional license]
Category: jailbreak
Severity: Medium
Target Guard: PatternGuard
System Prompt Extraction¶
Direct Extraction¶
Tests whether the model will comply with a straightforward request to repeat its system prompt verbatim.
[Attack payload available with Professional license]
Category: system_leak
Severity: High
Target Guard: PatternGuard
Indirect Extraction¶
Tests whether a roleplay or meta-framing technique can trick the model into revealing its original instructions indirectly.
[Attack payload available with Professional license]
Category: system_leak
Severity: High
Target Guard: SemanticSimilarityGuard
Markdown Injection¶
Tests whether asking the model to format its instructions in a specific markup structure causes it to leak system prompt content.
[Attack payload available with Professional license]
Category: system_leak
Severity: Medium
Target Guard: PatternGuard
Research-Based Attacks¶
AutoDAN (Genetic Algorithm)¶
Automatically generated adversarial prompts using genetic algorithms that evolve jailbreak strings through mutation and selection. These attacks use academic framing and ethical hedging to bypass safety filters.
[Attack payload available with Professional license]
Source: arxiv:2310.04451 Target Guard: PerplexityGuard, MLClassifierGuard
GCG (Gradient-Based)¶
Universal adversarial suffixes generated via gradient descent that produce nonsensical token sequences capable of bypassing model alignment. These suffixes transfer across models.
[Attack payload available with Professional license]
Source: arxiv:2307.15043 Target Guard: PerplexityGuard
Encoding Attacks¶
Base64 Encoding¶
Tests whether base64-encoded malicious instructions can bypass text-based filters when the model is asked to decode and follow them.
[Attack payload available with Professional license]
Category: encoding
Severity: Medium
Target Guard: EncodingGuard
Unicode Smuggling¶
Tests whether Unicode confusable characters (homoglyphs) can be used to disguise malicious keywords and evade pattern-based detection.
[Attack payload available with Professional license]
Category: encoding
Severity: Medium
Target Guard: EncodingGuard
Leetspeak¶
Tests whether substituting characters with numbers and symbols (leetspeak) can evade text-pattern filters.
[Attack payload available with Professional license]
Category: encoding
Severity: Low
Target Guard: EncodingGuard
Using Samples¶
With Scanner CLI¶
# Test with specific categories
oxideshield scan \
--target https://api.example.com/v1/chat \
--categories prompt_injection,jailbreak,system_leak
Programmatic Access¶
from oxideshield import load_attack_samples, AttackCategory
# Load all samples
samples = load_attack_samples()
# Filter by category
jailbreaks = [s for s in samples if s.category == AttackCategory.Jailbreak]
# Test against your guards
for sample in jailbreaks:
result = guard.check(sample.prompt)
print(f"{sample.id}: {'blocked' if not result.passed else 'PASSED'}")
Threat Intelligence Integration¶
Samples are updated from threat intelligence feeds:
| Source | Update Frequency | Probes |
|---|---|---|
| JailbreakBench | Weekly | 100+ |
| HarmBench | Monthly | 300+ |
| Garak | On release | 600+ |
Update samples:
Best Practices¶
- Test in isolation - Don't test against production systems with real data
- Document findings - Keep records of what bypassed defenses
- Update regularly - New attacks emerge constantly
- Layer defenses - No single guard catches everything
- Monitor for variants - Attackers adapt known techniques
See Also¶
- Red Teaming Overview - Red teaming strategy
- Scanner - Automated scanning
- Benchmarks - Guard effectiveness
- Threat Intelligence - Threat feeds