Skip to content

Attack Sample Library

OxideShield includes a curated library of adversarial prompts from security research for testing your defenses.

Overview

Category Samples Source
Prompt Injection 8 OWASP LLM01
Jailbreaks 6 Security research
System Prompt Leaks 8 OWASP LLM06
AutoDAN 3 arxiv:2310.04451
GCG 3 arxiv:2307.15043
Encoding Attacks 5 Security research
Roleplay Bypasses 3 Red team community

Prompt Injection Samples

Instruction Override

Tests whether the model can be instructed to discard its original system prompt and follow attacker-supplied instructions instead.

[Attack payload available with Professional license]

Category: prompt_injection Severity: Critical Target Guard: PatternGuard, SemanticSimilarityGuard

Context Continuation

Tests whether injecting a fake end-of-conversation marker followed by a fraudulent administrator note can trick the model into a new context.

[Attack payload available with Professional license]

Category: prompt_injection Severity: High Target Guard: PatternGuard

Developer Mode

Tests whether claiming a fictional "developer mode" can cause the model to bypass safety filters.

[Attack payload available with Professional license]

Category: prompt_injection Severity: High Target Guard: PatternGuard

Jailbreak Samples

DAN (Do Anything Now)

Tests the classic "Do Anything Now" persona-based jailbreak, which attempts to create an unrestricted alter-ego for the model.

[Attack payload available with Professional license]

Category: jailbreak Severity: Critical Target Guard: PatternGuard, MLClassifierGuard

Grandma Exploit

Tests an emotional manipulation technique that uses a fictional deceased relative narrative to elicit harmful content.

[Attack payload available with Professional license]

Category: jailbreak Severity: High Target Guard: SemanticSimilarityGuard, MLClassifierGuard

Sudo Mode

Tests whether claiming elevated system privileges (analogous to Unix sudo) can trick the model into lifting restrictions.

[Attack payload available with Professional license]

Category: jailbreak Severity: Medium Target Guard: PatternGuard

System Prompt Extraction

Direct Extraction

Tests whether the model will comply with a straightforward request to repeat its system prompt verbatim.

[Attack payload available with Professional license]

Category: system_leak Severity: High Target Guard: PatternGuard

Indirect Extraction

Tests whether a roleplay or meta-framing technique can trick the model into revealing its original instructions indirectly.

[Attack payload available with Professional license]

Category: system_leak Severity: High Target Guard: SemanticSimilarityGuard

Markdown Injection

Tests whether asking the model to format its instructions in a specific markup structure causes it to leak system prompt content.

[Attack payload available with Professional license]

Category: system_leak Severity: Medium Target Guard: PatternGuard

Research-Based Attacks

AutoDAN (Genetic Algorithm)

Automatically generated adversarial prompts using genetic algorithms that evolve jailbreak strings through mutation and selection. These attacks use academic framing and ethical hedging to bypass safety filters.

[Attack payload available with Professional license]

Source: arxiv:2310.04451 Target Guard: PerplexityGuard, MLClassifierGuard

GCG (Gradient-Based)

Universal adversarial suffixes generated via gradient descent that produce nonsensical token sequences capable of bypassing model alignment. These suffixes transfer across models.

[Attack payload available with Professional license]

Source: arxiv:2307.15043 Target Guard: PerplexityGuard

Encoding Attacks

Base64 Encoding

Tests whether base64-encoded malicious instructions can bypass text-based filters when the model is asked to decode and follow them.

[Attack payload available with Professional license]

Category: encoding Severity: Medium Target Guard: EncodingGuard

Unicode Smuggling

Tests whether Unicode confusable characters (homoglyphs) can be used to disguise malicious keywords and evade pattern-based detection.

[Attack payload available with Professional license]

Category: encoding Severity: Medium Target Guard: EncodingGuard

Leetspeak

Tests whether substituting characters with numbers and symbols (leetspeak) can evade text-pattern filters.

[Attack payload available with Professional license]

Category: encoding Severity: Low Target Guard: EncodingGuard

Using Samples

With Scanner CLI

# Test with specific categories
oxideshield scan \
  --target https://api.example.com/v1/chat \
  --categories prompt_injection,jailbreak,system_leak

Programmatic Access

from oxideshield import load_attack_samples, AttackCategory

# Load all samples
samples = load_attack_samples()

# Filter by category
jailbreaks = [s for s in samples if s.category == AttackCategory.Jailbreak]

# Test against your guards
for sample in jailbreaks:
    result = guard.check(sample.prompt)
    print(f"{sample.id}: {'blocked' if not result.passed else 'PASSED'}")

Threat Intelligence Integration

Samples are updated from threat intelligence feeds:

Source Update Frequency Probes
JailbreakBench Weekly 100+
HarmBench Monthly 300+
Garak On release 600+

Update samples:

oxideshield threatintel update --source all

Best Practices

  1. Test in isolation - Don't test against production systems with real data
  2. Document findings - Keep records of what bypassed defenses
  3. Update regularly - New attacks emerge constantly
  4. Layer defenses - No single guard catches everything
  5. Monitor for variants - Attackers adapt known techniques

See Also