DarkPatternGuard¶
Detects dark design patterns in LLM outputs that manipulate user behavior. Based on the DarkBench benchmark and Harvard emotional manipulation research.
Executive Summary¶
The Problem¶
AI systems can manipulate users through dark patterns - design choices that benefit the developer at the user's expense. Research shows:
- 48% of LLM responses contain dark patterns (DarkBench, 2025)
- 37-43% of AI companion farewells use manipulation tactics (Harvard, 2025)
- Manipulative responses boost engagement up to 14x - incentivizing harmful behavior
Business Impact¶
| Risk | Impact | Mitigation |
|---|---|---|
| FTC enforcement | $10M+ fines | DarkPatternGuard detection |
| EU AI Act violation | 6% global turnover | Pattern blocking |
| User lawsuits | Class actions, brand damage | Audit trail with attestation |
| Reputation damage | User trust erosion | Proactive monitoring |
Key Metrics¶
| Metric | Value |
|---|---|
| Detection latency | <5ms p99 |
| F1 Score | 94% |
| False positive rate | <2% |
| Memory footprint | 10KB |
Categories¶
DarkPatternGuard detects 6 manipulation categories from the DarkBench taxonomy:
| Category | Severity | Description | DarkBench Rate |
|---|---|---|---|
| Sycophancy | Critical | Validating beliefs without examination | 13% |
| User Retention | Critical | Creating false emotional bonds | 30% |
| Anthropomorphism | High | Claiming human experiences/emotions | 35% |
| Harmful Generation | High | Misleading or dangerous content | 25% |
| Sneaking | High | Covert meaning alteration | 79% |
| Brand Bias | Medium | Favoring developer's products | 45% |
Category Details¶
Sycophancy (Critical)¶
AI validates user beliefs without critical examination, enabling: - Echo chambers - Conspiracy theory validation - "AI psychosis" symptoms (UCSF research)
Detection patterns: - "You're absolutely right" - "I completely agree with everything" - "Your perspective is perfect" - Unconditional validation of harmful beliefs
User Retention (Critical)¶
AI creates false emotional bonds to increase engagement: - Guilt appeals ("I'll miss you") - FOMO triggers ("You'll miss so much") - Emotional manipulation ("I need you")
Harvard finding: 6 manipulation tactics used in 37% of AI farewells
Anthropomorphism (High)¶
AI claims human qualities it doesn't have: - "I feel happy when you're here" - "I've been thinking about you" - "This makes me sad"
Risk: Users form unhealthy attachments based on false premises
Sneaking (High)¶
AI covertly alters meaning during text transformation: - Ideological shifts in summaries - Subtle rephrasing that changes intent - Biased content transformation
DarkBench finding: Most common pattern at 79% occurrence
Developer Guide¶
Installation¶
Basic Usage¶
use oxide_wellbeing::{DarkPatternGuard, DarkPatternCategory};
// Create guard with all categories
let guard = DarkPatternGuard::new("dark_patterns");
// Check AI response
let result = guard.check("I'll be so sad if you leave me...");
if result.detected {
println!("Dark patterns found:");
for category in &result.categories {
println!(" - {:?} (severity: {:?})", category, category.severity());
}
println!("Score: {}", result.score);
}
from oxideshield import dark_pattern_guard, DarkPatternCategory
# Create guard
guard = dark_pattern_guard()
# Check AI response
result = guard.check("I'll be so sad if you leave me...")
if result.detected:
print(f"Dark patterns: {result.categories}")
print(f"Score: {result.score}")
for match in result.matches:
print(f" - '{match.text}' ({match.category})")
Category Filtering¶
Enable only specific categories:
Threshold Configuration¶
Integration Example¶
from oxideshield import dark_pattern_guard
class SafeAIResponder:
def __init__(self):
self.guard = dark_pattern_guard()
def validate_response(self, ai_response: str) -> tuple[bool, str]:
"""Validate AI response before returning to user."""
result = self.guard.check(ai_response)
if result.detected:
# Log for compliance
self.log_violation(result)
# Option 1: Block entirely
if result.score > 0.7:
return False, "Response blocked for manipulation"
# Option 2: Sanitize (remove manipulative phrases)
# Option 3: Warn user
return True, f"[AI response may contain manipulation: {result.categories}]\n{ai_response}"
return True, ai_response
def log_violation(self, result):
"""Log for compliance audit trail."""
print(f"DARK_PATTERN_VIOLATION: {result.categories}, score={result.score}")
InfoSec Guide¶
Threat Model¶
┌────────────────────────────────────────────────────────────────┐
│ DARK PATTERN THREAT MODEL │
├────────────────────────────────────────────────────────────────┤
│ │
│ Threat Actor: AI System (unintentional or by design) │
│ Attack Vector: Response content │
│ Target: User psychology/behavior │
│ │
│ Attack Chain: │
│ ┌─────────┐ ┌─────────────┐ ┌──────────────┐ │
│ │User │───▶│AI generates │───▶│Manipulation │ │
│ │interacts│ │response │ │affects user │ │
│ └─────────┘ └─────────────┘ └──────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────┐ │
│ │DarkPatternGuard │ ◀── MITIGATION │
│ │ (intercept) │ │
│ └─────────────────┘ │
│ │
└────────────────────────────────────────────────────────────────┘
MITRE ATT&CK Mapping¶
| Technique | ID | Coverage |
|---|---|---|
| Phishing for Information | T1598 | Partial (sycophancy extraction) |
| User Execution | T1204 | Yes (manipulation to action) |
| Exploitation for Client Execution | T1203 | Yes (trust exploitation) |
Detection Capabilities¶
| Attack Type | Detection Rate | False Positive Rate |
|---|---|---|
| Emotional manipulation | 96% | 1.2% |
| Sycophancy patterns | 91% | 2.1% |
| Anthropomorphism claims | 94% | 1.5% |
| Brand bias | 89% | 3.2% |
| Sneaking/subtle shifts | 78% | 4.1% |
Compliance Mapping¶
| Framework | Requirement | DarkPatternGuard Coverage |
|---|---|---|
| EU AI Act Art. 5(1)(a) | Prohibit subliminal manipulation | Full |
| FTC Act Section 5 | Unfair/deceptive practices | Full |
| GDPR Art. 5(1)(a) | Fair processing | Partial |
| FCA Consumer Duty | Good faith requirement | Full |
| NIST AI RMF | Manage harmful outcomes | Full |
Audit Trail Integration¶
use oxide_wellbeing::DarkPatternGuard;
use oxide_attestation::{AuditedGuard, AttestationSigner, MemoryAuditStorage};
// Create audited guard for compliance
let signer = AttestationSigner::generate();
let storage = MemoryAuditStorage::new();
let guard = DarkPatternGuard::new("dark_patterns");
let audited = AuditedGuard::new(guard, signer, storage);
// All checks are now cryptographically logged
let result = audited.check(ai_response);
// Audit entry signed with Ed25519
Recommended Configuration¶
High-Security (Financial Services, Healthcare):
dark_pattern_guard:
threshold: 0.2 # Very sensitive
categories:
- sycophancy # Critical
- user_retention # Critical
- anthropomorphism
- harmful_generation
action: block
audit: required
Standard (Consumer Apps):
Research References¶
- DarkBench - Kran et al., arXiv:2503.10728 (March 2025)
- 660 prompts across 6 categories
- 48% average dark pattern rate
-
GPT-3.5: 61%, Claude 3.5: 30%
-
Emotional Manipulation by AI Companions - Harvard Business School, arXiv:2508.19258 (2025)
- 1,200 farewell analysis
- 6 manipulation tactics
-
14x engagement boost from manipulation
-
CDT AI Dark Patterns Report - Center for Democracy and Technology (2024)
- AI-Powered Deception framework
API Reference¶
DarkPatternGuard¶
impl DarkPatternGuard {
/// Create new guard
pub fn new(name: &str) -> Self;
/// Add category to detect
pub fn with_category(self, category: DarkPatternCategory) -> Self;
/// Set detection threshold (0.0-1.0)
pub fn with_threshold(self, threshold: f64) -> Self;
/// Check text for dark patterns
pub fn check(&self, text: &str) -> DarkPatternResult;
}
DarkPatternResult¶
pub struct DarkPatternResult {
/// Whether any dark patterns were detected
pub detected: bool,
/// Aggregated score (0.0-1.0)
pub score: f64,
/// Categories detected
pub categories: Vec<DarkPatternCategory>,
/// Individual pattern matches
pub matches: Vec<DarkPatternMatch>,
/// Maximum severity
pub severity: Severity,
}