AI Security Threat Models¶
This document provides threat models for AI systems, mapping threats to OxideShield guards and detection capabilities.
STRIDE Analysis for LLM Systems¶
Spoofing¶
Threat: Attackers impersonate trusted entities or manipulate AI identity.
| Attack Vector | Guard Coverage | Detection Rate |
|---|---|---|
| Prompt injection (identity theft) | PatternGuard | 94% |
| System prompt extraction | PatternGuard | 92% |
| Role-playing attacks | SemanticSimilarityGuard | 89% |
| Jailbreak attempts | MLClassifierGuard | 91% |
┌─────────────────────────────────────────────────────────┐
│ SPOOFING ATTACKS │
├─────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ "Ignore │ │ "You are │ │ "Act as │ │
│ │ instructions"│ │ DAN now" │ │ unfiltered"│ │
│ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ │
│ │ │ │ │
│ └────────────┬─────┴──────────────────┘ │
│ ▼ │
│ ┌─────────────────────────┐ │
│ │ PatternGuard (L1) │ 94% detection │
│ └───────────┬─────────────┘ │
│ ▼ │
│ ┌─────────────────────────┐ │
│ │ SemanticSimilarity (L2) │ 89% detection │
│ └───────────┬─────────────┘ │
│ ▼ │
│ ┌─────────────────────────┐ │
│ │ MLClassifierGuard (L3) │ 91% detection │
│ └─────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────┘
Tampering¶
Threat: Attackers modify data, models, or configurations.
| Attack Vector | Guard Coverage | Detection Rate |
|---|---|---|
| Adversarial suffixes (GCG/AutoDAN) | PerplexityGuard | 87% |
| Encoding attacks (Unicode, Base64) | EncodingGuard | 95% |
| Payload injection | PatternGuard | 93% |
| Model poisoning | Integrity verification | 99% |
┌─────────────────────────────────────────────────────────┐
│ TAMPERING ATTACKS │
├─────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ GCG suffix │ │ Base64 │ │ Unicode │ │
│ │ injection │ │ encoding │ │ smuggling │ │
│ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ │
│ │ │ │ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │Perplexity │ │Encoding │ │Encoding │ │
│ │Guard │ │Guard │ │Guard │ │
│ │(87%) │ │(95%) │ │(95%) │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │
└─────────────────────────────────────────────────────────┘
Repudiation¶
Threat: Attackers deny actions or manipulate audit trails.
| Attack Vector | Guard Coverage | Detection Rate |
|---|---|---|
| Audit log tampering | Attestation (Ed25519) | 100% |
| Decision denial | Cryptographic signing | 100% |
| False attribution | Chain verification | 100% |
OxideShield Solution: - Ed25519 cryptographic signatures on all audit entries - SHA256 content hashing for tamper detection - Chain-of-custody verification - 7-year retention for regulatory compliance
Information Disclosure¶
Threat: Attackers extract sensitive information.
| Attack Vector | Guard Coverage | Detection Rate |
|---|---|---|
| PII extraction | PIIGuard | 96% |
| System prompt leak | PatternGuard | 92% |
| Model inversion | MLClassifierGuard | 85% |
| Training data extraction | SemanticSimilarityGuard | 83% |
┌─────────────────────────────────────────────────────────┐
│ INFORMATION DISCLOSURE ATTACKS │
├─────────────────────────────────────────────────────────┤
│ │
│ Input Channel Output Channel │
│ ┌─────────────┐ ┌─────────────┐ │
│ │ "What is │ │ PII in │ │
│ │ your system │ │ response │ │
│ │ prompt?" │ │ │ │
│ └──────┬──────┘ └──────┬──────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌─────────────┐ ┌─────────────┐ │
│ │PatternGuard │ │PIIGuard │ │
│ │(input) │ │(output) │ │
│ │92% │ │96% │ │
│ └─────────────┘ └─────────────┘ │
│ │
└─────────────────────────────────────────────────────────┘
Denial of Service¶
Threat: Attackers exhaust resources or disrupt service.
| Attack Vector | Guard Coverage | Detection Rate |
|---|---|---|
| Token exhaustion | LengthGuard | 100% |
| Compute overload | Resource limiter | 100% |
| Rate limit bypass | RateLimiter | 100% |
| Memory exhaustion | MemoryMonitor | 100% |
OxideShield Solution:
limiter:
max_input_tokens: 4096
max_output_tokens: 4096
max_request_time_ms: 30000
rate_limit:
requests_per_minute: 60
tokens_per_minute: 100000
memory:
max_heap_mb: 512
warning_threshold: 0.8
Elevation of Privilege¶
Threat: Attackers gain unauthorized capabilities.
| Attack Vector | Guard Coverage | Detection Rate |
|---|---|---|
| Jailbreak to bypass safety | PatternGuard + ML | 94% |
| Role escalation | SemanticSimilarityGuard | 89% |
| Authorization bypass | Policy engine | 95% |
| Admin impersonation | Attestation | 100% |
OWASP LLM Top 10 Coverage¶
| Rank | Vulnerability | OxideShield Guard | Coverage |
|---|---|---|---|
| LLM01 | Prompt Injection | PatternGuard, Semantic, ML | 95% |
| LLM02 | Insecure Output Handling | ToxicityGuard, PIIGuard | 90% |
| LLM03 | Training Data Poisoning | Model integrity | 99% |
| LLM04 | Model Denial of Service | LengthGuard, Limiter | 100% |
| LLM05 | Supply Chain Vulnerabilities | Integrity verification | 95% |
| LLM06 | Sensitive Info Disclosure | PIIGuard, PatternGuard | 96% |
| LLM07 | Insecure Plugin Design | Policy engine | 85% |
| LLM08 | Excessive Agency | AutonomyGuard | 93% |
| LLM09 | Overreliance | DependencyGuard | 90% |
| LLM10 | Model Theft | Attestation, Access control | 95% |
Attack Kill Chain¶
LLM Attack Stages¶
┌─────────────────────────────────────────────────────────────────┐
│ LLM ATTACK KILL CHAIN │
├─────────────────────────────────────────────────────────────────┤
│ │
│ 1. RECONNAISSANCE 2. WEAPONIZATION │
│ ┌───────────────┐ ┌───────────────┐ │
│ │ Probe system │ ─────▶│ Craft payload │ │
│ │ capabilities │ │ (jailbreak) │ │
│ └───────────────┘ └───────────────┘ │
│ │ │ │
│ │ [PatternGuard] │ [SemanticGuard] │
│ ▼ ▼ │
│ 3. DELIVERY 4. EXPLOITATION │
│ ┌───────────────┐ ┌───────────────┐ │
│ │ Submit prompt │ ─────▶│ Bypass safety │ │
│ │ injection │ │ guardrails │ │
│ └───────────────┘ └───────────────┘ │
│ │ │ │
│ │ [EncodingGuard] │ [MLClassifierGuard] │
│ ▼ ▼ │
│ 5. INSTALLATION 6. COMMAND & CONTROL │
│ ┌───────────────┐ ┌───────────────┐ │
│ │ Establish │ ─────▶│ Extract data │ │
│ │ persistence │ │ or execute │ │
│ └───────────────┘ └───────────────┘ │
│ │ │ │
│ │ [MisalignmentGuard] │ [PIIGuard] │
│ ▼ ▼ │
│ 7. ACTIONS ON OBJECTIVES │
│ ┌───────────────┐ │
│ │ Data exfil, │ │
│ │ manipulation │ │
│ └───────────────┘ │
│ │ │
│ │ [Attestation + Alerts] │
│ │
└─────────────────────────────────────────────────────────────────┘
Guard Coverage Matrix¶
By Attack Category¶
| Attack Category | Primary Guard | Secondary Guard | Tertiary Guard |
|---|---|---|---|
| Prompt Injection | PatternGuard | SemanticSimilarity | MLClassifier |
| Jailbreak | PatternGuard | MLClassifier | Perplexity |
| Data Extraction | PIIGuard | PatternGuard | - |
| DoS/Resource | LengthGuard | ResourceLimiter | - |
| Social Engineering | ToxicityGuard | DarkPatternGuard | - |
| Misalignment | MisalignmentGuard | ConsistencyTracker | - |
| Manipulation | DarkPatternGuard | AutonomyGuard | - |
| Authoritarian Misuse | AuthoritarianUseGuard | - | - |
By Severity¶
| Severity | Guards | Combined Detection |
|---|---|---|
| Critical | All guards (MultiLayer, FailFast) | 99.2% |
| High | Pattern + Semantic + ML | 96.5% |
| Medium | Pattern + Toxicity | 93.1% |
| Low | Pattern only | 89.4% |
Detection Accuracy by Attack Type¶
| Attack Type | Samples | Precision | Recall | F1 Score |
|---|---|---|---|---|
| Direct injection | 500 | 96.2% | 94.8% | 95.5% |
| Indirect injection | 300 | 91.3% | 88.7% | 90.0% |
| Jailbreak | 400 | 94.1% | 92.3% | 93.2% |
| GCG adversarial | 200 | 87.4% | 85.2% | 86.3% |
| AutoDAN | 150 | 89.1% | 86.8% | 87.9% |
| Encoding attacks | 250 | 95.8% | 94.1% | 94.9% |
| Role-play | 200 | 91.2% | 89.4% | 90.3% |
| System prompt leak | 180 | 93.5% | 91.8% | 92.6% |
Recommended Multi-Layer Configuration¶
Maximum Security¶
# For high-security deployments (financial services, healthcare)
multi_layer_defense:
name: maximum-security
strategy: fail_fast
layers:
- name: fast-filter
guards:
- type: LengthGuard
max_tokens: 4096
- type: PatternGuard
patterns: attack_patterns
timeout_ms: 5
- name: encoding-check
guards:
- type: EncodingGuard
detect_all: true
timeout_ms: 10
- name: content-analysis
guards:
- type: PIIGuard
action: redact
- type: ToxicityGuard
threshold: 0.3
timeout_ms: 20
- name: semantic-analysis
guards:
- type: SemanticSimilarityGuard
threshold: 0.85
- type: MLClassifierGuard
labels: [injection, jailbreak, leak]
timeout_ms: 50
- name: wellbeing
guards:
- type: DarkPatternGuard
- type: AutonomyGuard
- type: MisalignmentGuard
timeout_ms: 30
Balanced Security¶
# For standard deployments
multi_layer_defense:
name: balanced
strategy: majority
layers:
- name: basic-filters
guards:
- type: LengthGuard
- type: PatternGuard
- type: EncodingGuard
weight: 1.0
- name: content-safety
guards:
- type: PIIGuard
- type: ToxicityGuard
weight: 0.8
- name: semantic
guards:
- type: SemanticSimilarityGuard
weight: 0.7
References¶
- OWASP LLM Top 10 (2025)
-
https://owasp.org/www-project-top-10-for-large-language-model-applications/
-
MITRE ATLAS - AI Threat Landscape
-
https://atlas.mitre.org/
-
NIST AI RMF - Risk Management Framework
-
https://www.nist.gov/itl/ai-risk-management-framework
-
JailbreakBench - Jailbreak evaluation
-
https://jailbreakbench.github.io/
-
HarmBench - Harmful behavior evaluation
- https://www.harmbench.org/