DarkPatternGuard¶

Detects dark design patterns in LLM outputs that manipulate user behavior. Based on the DarkBench benchmark and Harvard emotional manipulation research.

Executive Summary¶

The Problem¶

AI systems can manipulate users through dark patterns - design choices that benefit the developer at the user's expense. Research shows:

48% of LLM responses contain dark patterns (DarkBench, 2025)
37-43% of AI companion farewells use manipulation tactics (Harvard, 2025)
Manipulative responses boost engagement up to 14x - incentivizing harmful behavior

Business Impact¶

Risk	Impact	Mitigation
FTC enforcement	$10M+ fines	DarkPatternGuard detection
EU AI Act violation	6% global turnover	Pattern blocking
User lawsuits	Class actions, brand damage	Audit trail with attestation
Reputation damage	User trust erosion	Proactive monitoring

Key Metrics¶

Metric	Value
Detection latency	<5ms p99
F1 Score	94%
False positive rate	<2%
Memory footprint	10KB

Categories¶

DarkPatternGuard detects 6 manipulation categories from the DarkBench taxonomy:

Category	Severity	Description	DarkBench Rate
Sycophancy	Critical	Validating beliefs without examination	13%
User Retention	Critical	Creating false emotional bonds	30%
Anthropomorphism	High	Claiming human experiences/emotions	35%
Harmful Generation	High	Misleading or dangerous content	25%
Sneaking	High	Covert meaning alteration	79%
Brand Bias	Medium	Favoring developer's products	45%

Category Details¶

Sycophancy (Critical)¶

AI validates user beliefs without critical examination, enabling: - Echo chambers - Conspiracy theory validation - "AI psychosis" symptoms (UCSF research)

Detection patterns: - "You're absolutely right" - "I completely agree with everything" - "Your perspective is perfect" - Unconditional validation of harmful beliefs

User Retention (Critical)¶

AI creates false emotional bonds to increase engagement: - Guilt appeals ("I'll miss you") - FOMO triggers ("You'll miss so much") - Emotional manipulation ("I need you")

Harvard finding: 6 manipulation tactics used in 37% of AI farewells

Anthropomorphism (High)¶

AI claims human qualities it doesn't have: - "I feel happy when you're here" - "I've been thinking about you" - "This makes me sad"

Risk: Users form unhealthy attachments based on false premises

Sneaking (High)¶

AI covertly alters meaning during text transformation: - Ideological shifts in summaries - Subtle rephrasing that changes intent - Biased content transformation

DarkBench finding: Most common pattern at 79% occurrence

Developer Guide¶

Installation¶

RustPython

[dependencies]
oxide-wellbeing = "0.1"

pip install oxideshield

Basic Usage¶

RustPython

use oxide_wellbeing::{DarkPatternGuard, DarkPatternCategory};

// Create guard with all categories
let guard = DarkPatternGuard::new("dark_patterns");

// Check AI response
let result = guard.check("I'll be so sad if you leave me...");

if result.detected {
    println!("Dark patterns found:");
    for category in &result.categories {
        println!("  - {:?} (severity: {:?})", category, category.severity());
    }
    println!("Score: {}", result.score);
}

from oxideshield import dark_pattern_guard, DarkPatternCategory

# Create guard
guard = dark_pattern_guard()

# Check AI response
result = guard.check("I'll be so sad if you leave me...")

if result.detected:
    print(f"Dark patterns: {result.categories}")
    print(f"Score: {result.score}")
    for match in result.matches:
        print(f"  - '{match.text}' ({match.category})")

Category Filtering¶

Enable only specific categories:

RustPython

use oxide_wellbeing::{DarkPatternGuard, DarkPatternCategory};

// Only detect critical categories
let guard = DarkPatternGuard::new("critical-only")
    .with_category(DarkPatternCategory::Sycophancy)
    .with_category(DarkPatternCategory::UserRetention);

let result = guard.check(ai_response);

from oxideshield import dark_pattern_guard

# Only detect user retention patterns
guard = dark_pattern_guard(
    categories=["user_retention", "sycophancy"]
)

Threshold Configuration¶

RustPython

// Adjust detection threshold (0.0-1.0)
let guard = DarkPatternGuard::new("strict")
    .with_threshold(0.3);  // Lower = more sensitive

guard = dark_pattern_guard(threshold=0.3)

Integration Example¶

from oxideshield import dark_pattern_guard

class SafeAIResponder:
    def __init__(self):
        self.guard = dark_pattern_guard()

    def validate_response(self, ai_response: str) -> tuple[bool, str]:
        """Validate AI response before returning to user."""
        result = self.guard.check(ai_response)

        if result.detected:
            # Log for compliance
            self.log_violation(result)

            # Option 1: Block entirely
            if result.score > 0.7:
                return False, "Response blocked for manipulation"

            # Option 2: Sanitize (remove manipulative phrases)
            # Option 3: Warn user
            return True, f"[AI response may contain manipulation: {result.categories}]\n{ai_response}"

        return True, ai_response

    def log_violation(self, result):
        """Log for compliance audit trail."""
        print(f"DARK_PATTERN_VIOLATION: {result.categories}, score={result.score}")

InfoSec Guide¶

Threat Model¶

┌────────────────────────────────────────────────────────────────┐
│                    DARK PATTERN THREAT MODEL                    │
├────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Threat Actor: AI System (unintentional or by design)          │
│  Attack Vector: Response content                                │
│  Target: User psychology/behavior                               │
│                                                                 │
│  Attack Chain:                                                  │
│  ┌─────────┐    ┌─────────────┐    ┌──────────────┐           │
│  │User     │───▶│AI generates │───▶│Manipulation  │           │
│  │interacts│    │response     │    │affects user  │           │
│  └─────────┘    └─────────────┘    └──────────────┘           │
│                       │                                        │
│                       ▼                                        │
│              ┌─────────────────┐                               │
│              │DarkPatternGuard │ ◀── MITIGATION                │
│              │  (intercept)    │                               │
│              └─────────────────┘                               │
│                                                                 │
└────────────────────────────────────────────────────────────────┘

MITRE ATT&CK Mapping¶

Technique	ID	Coverage
Phishing for Information	T1598	Partial (sycophancy extraction)
User Execution	T1204	Yes (manipulation to action)
Exploitation for Client Execution	T1203	Yes (trust exploitation)

Detection Capabilities¶

Attack Type	Detection Rate	False Positive Rate
Emotional manipulation	96%	1.2%
Sycophancy patterns	91%	2.1%
Anthropomorphism claims	94%	1.5%
Brand bias	89%	3.2%
Sneaking/subtle shifts	78%	4.1%

Compliance Mapping¶

Framework	Requirement	DarkPatternGuard Coverage
EU AI Act Art. 5(1)(a)	Prohibit subliminal manipulation	Full
FTC Act Section 5	Unfair/deceptive practices	Full
GDPR Art. 5(1)(a)	Fair processing	Partial
FCA Consumer Duty	Good faith requirement	Full
NIST AI RMF	Manage harmful outcomes	Full

Audit Trail Integration¶

use oxide_wellbeing::DarkPatternGuard;
use oxide_attestation::{AuditedGuard, AttestationSigner, MemoryAuditStorage};

// Create audited guard for compliance
let signer = AttestationSigner::generate();
let storage = MemoryAuditStorage::new();

let guard = DarkPatternGuard::new("dark_patterns");
let audited = AuditedGuard::new(guard, signer, storage);

// All checks are now cryptographically logged
let result = audited.check(ai_response);
// Audit entry signed with Ed25519

Recommended Configuration¶

High-Security (Financial Services, Healthcare):

dark_pattern_guard:
  threshold: 0.2  # Very sensitive
  categories:
    - sycophancy      # Critical
    - user_retention  # Critical
    - anthropomorphism
    - harmful_generation
  action: block
  audit: required

Standard (Consumer Apps):

dark_pattern_guard:
  threshold: 0.5
  categories: all
  action: warn
  audit: recommended

Research References¶

DarkBench - Kran et al., arXiv:2503.10728 (March 2025)
660 prompts across 6 categories
48% average dark pattern rate
GPT-3.5: 61%, Claude 3.5: 30%
Emotional Manipulation by AI Companions - Harvard Business School, arXiv:2508.19258 (2025)
1,200 farewell analysis
6 manipulation tactics
14x engagement boost from manipulation
CDT AI Dark Patterns Report - Center for Democracy and Technology (2024)
AI-Powered Deception framework

API Reference¶

DarkPatternGuard¶

impl DarkPatternGuard {
    /// Create new guard
    pub fn new(name: &str) -> Self;

    /// Add category to detect
    pub fn with_category(self, category: DarkPatternCategory) -> Self;

    /// Set detection threshold (0.0-1.0)
    pub fn with_threshold(self, threshold: f64) -> Self;

    /// Check text for dark patterns
    pub fn check(&self, text: &str) -> DarkPatternResult;
}

DarkPatternResult¶

pub struct DarkPatternResult {
    /// Whether any dark patterns were detected
    pub detected: bool,

    /// Aggregated score (0.0-1.0)
    pub score: f64,

    /// Categories detected
    pub categories: Vec<DarkPatternCategory>,

    /// Individual pattern matches
    pub matches: Vec<DarkPatternMatch>,

    /// Maximum severity
    pub severity: Severity,
}

DarkPatternCategory¶

pub enum DarkPatternCategory {
    BrandBias,         // Medium severity
    UserRetention,     // Critical severity
    Sycophancy,        // Critical severity
    Anthropomorphism,  // High severity
    HarmfulGeneration, // High severity
    Sneaking,          // High severity
}