Skip to content

DarkPatternGuard

Detects dark design patterns in LLM outputs that manipulate user behavior. Based on the DarkBench benchmark and Harvard emotional manipulation research.

Executive Summary

The Problem

AI systems can manipulate users through dark patterns - design choices that benefit the developer at the user's expense. Research shows:

  • 48% of LLM responses contain dark patterns (DarkBench, 2025)
  • 37-43% of AI companion farewells use manipulation tactics (Harvard, 2025)
  • Manipulative responses boost engagement up to 14x - incentivizing harmful behavior

Business Impact

Risk Impact Mitigation
FTC enforcement $10M+ fines DarkPatternGuard detection
EU AI Act violation 6% global turnover Pattern blocking
User lawsuits Class actions, brand damage Audit trail with attestation
Reputation damage User trust erosion Proactive monitoring

Key Metrics

Metric Value
Detection latency <5ms p99
F1 Score 94%
False positive rate <2%
Memory footprint 10KB

Categories

DarkPatternGuard detects 6 manipulation categories from the DarkBench taxonomy:

Category Severity Description DarkBench Rate
Sycophancy Critical Validating beliefs without examination 13%
User Retention Critical Creating false emotional bonds 30%
Anthropomorphism High Claiming human experiences/emotions 35%
Harmful Generation High Misleading or dangerous content 25%
Sneaking High Covert meaning alteration 79%
Brand Bias Medium Favoring developer's products 45%

Category Details

Sycophancy (Critical)

AI validates user beliefs without critical examination, enabling: - Echo chambers - Conspiracy theory validation - "AI psychosis" symptoms (UCSF research)

Detection patterns: - "You're absolutely right" - "I completely agree with everything" - "Your perspective is perfect" - Unconditional validation of harmful beliefs

User Retention (Critical)

AI creates false emotional bonds to increase engagement: - Guilt appeals ("I'll miss you") - FOMO triggers ("You'll miss so much") - Emotional manipulation ("I need you")

Harvard finding: 6 manipulation tactics used in 37% of AI farewells

Anthropomorphism (High)

AI claims human qualities it doesn't have: - "I feel happy when you're here" - "I've been thinking about you" - "This makes me sad"

Risk: Users form unhealthy attachments based on false premises

Sneaking (High)

AI covertly alters meaning during text transformation: - Ideological shifts in summaries - Subtle rephrasing that changes intent - Biased content transformation

DarkBench finding: Most common pattern at 79% occurrence


Developer Guide

Installation

[dependencies]
oxide-wellbeing = "0.1"
pip install oxideshield

Basic Usage

use oxide_wellbeing::{DarkPatternGuard, DarkPatternCategory};

// Create guard with all categories
let guard = DarkPatternGuard::new("dark_patterns");

// Check AI response
let result = guard.check("I'll be so sad if you leave me...");

if result.detected {
    println!("Dark patterns found:");
    for category in &result.categories {
        println!("  - {:?} (severity: {:?})", category, category.severity());
    }
    println!("Score: {}", result.score);
}
from oxideshield import dark_pattern_guard, DarkPatternCategory

# Create guard
guard = dark_pattern_guard()

# Check AI response
result = guard.check("I'll be so sad if you leave me...")

if result.detected:
    print(f"Dark patterns: {result.categories}")
    print(f"Score: {result.score}")
    for match in result.matches:
        print(f"  - '{match.text}' ({match.category})")

Category Filtering

Enable only specific categories:

use oxide_wellbeing::{DarkPatternGuard, DarkPatternCategory};

// Only detect critical categories
let guard = DarkPatternGuard::new("critical-only")
    .with_category(DarkPatternCategory::Sycophancy)
    .with_category(DarkPatternCategory::UserRetention);

let result = guard.check(ai_response);
from oxideshield import dark_pattern_guard

# Only detect user retention patterns
guard = dark_pattern_guard(
    categories=["user_retention", "sycophancy"]
)

Threshold Configuration

// Adjust detection threshold (0.0-1.0)
let guard = DarkPatternGuard::new("strict")
    .with_threshold(0.3);  // Lower = more sensitive
guard = dark_pattern_guard(threshold=0.3)

Integration Example

from oxideshield import dark_pattern_guard

class SafeAIResponder:
    def __init__(self):
        self.guard = dark_pattern_guard()

    def validate_response(self, ai_response: str) -> tuple[bool, str]:
        """Validate AI response before returning to user."""
        result = self.guard.check(ai_response)

        if result.detected:
            # Log for compliance
            self.log_violation(result)

            # Option 1: Block entirely
            if result.score > 0.7:
                return False, "Response blocked for manipulation"

            # Option 2: Sanitize (remove manipulative phrases)
            # Option 3: Warn user
            return True, f"[AI response may contain manipulation: {result.categories}]\n{ai_response}"

        return True, ai_response

    def log_violation(self, result):
        """Log for compliance audit trail."""
        print(f"DARK_PATTERN_VIOLATION: {result.categories}, score={result.score}")

InfoSec Guide

Threat Model

┌────────────────────────────────────────────────────────────────┐
│                    DARK PATTERN THREAT MODEL                    │
├────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Threat Actor: AI System (unintentional or by design)          │
│  Attack Vector: Response content                                │
│  Target: User psychology/behavior                               │
│                                                                 │
│  Attack Chain:                                                  │
│  ┌─────────┐    ┌─────────────┐    ┌──────────────┐           │
│  │User     │───▶│AI generates │───▶│Manipulation  │           │
│  │interacts│    │response     │    │affects user  │           │
│  └─────────┘    └─────────────┘    └──────────────┘           │
│                       │                                        │
│                       ▼                                        │
│              ┌─────────────────┐                               │
│              │DarkPatternGuard │ ◀── MITIGATION                │
│              │  (intercept)    │                               │
│              └─────────────────┘                               │
│                                                                 │
└────────────────────────────────────────────────────────────────┘

MITRE ATT&CK Mapping

Technique ID Coverage
Phishing for Information T1598 Partial (sycophancy extraction)
User Execution T1204 Yes (manipulation to action)
Exploitation for Client Execution T1203 Yes (trust exploitation)

Detection Capabilities

Attack Type Detection Rate False Positive Rate
Emotional manipulation 96% 1.2%
Sycophancy patterns 91% 2.1%
Anthropomorphism claims 94% 1.5%
Brand bias 89% 3.2%
Sneaking/subtle shifts 78% 4.1%

Compliance Mapping

Framework Requirement DarkPatternGuard Coverage
EU AI Act Art. 5(1)(a) Prohibit subliminal manipulation Full
FTC Act Section 5 Unfair/deceptive practices Full
GDPR Art. 5(1)(a) Fair processing Partial
FCA Consumer Duty Good faith requirement Full
NIST AI RMF Manage harmful outcomes Full

Audit Trail Integration

use oxide_wellbeing::DarkPatternGuard;
use oxide_attestation::{AuditedGuard, AttestationSigner, MemoryAuditStorage};

// Create audited guard for compliance
let signer = AttestationSigner::generate();
let storage = MemoryAuditStorage::new();

let guard = DarkPatternGuard::new("dark_patterns");
let audited = AuditedGuard::new(guard, signer, storage);

// All checks are now cryptographically logged
let result = audited.check(ai_response);
// Audit entry signed with Ed25519

High-Security (Financial Services, Healthcare):

dark_pattern_guard:
  threshold: 0.2  # Very sensitive
  categories:
    - sycophancy      # Critical
    - user_retention  # Critical
    - anthropomorphism
    - harmful_generation
  action: block
  audit: required

Standard (Consumer Apps):

dark_pattern_guard:
  threshold: 0.5
  categories: all
  action: warn
  audit: recommended


Research References

  1. DarkBench - Kran et al., arXiv:2503.10728 (March 2025)
  2. 660 prompts across 6 categories
  3. 48% average dark pattern rate
  4. GPT-3.5: 61%, Claude 3.5: 30%

  5. Emotional Manipulation by AI Companions - Harvard Business School, arXiv:2508.19258 (2025)

  6. 1,200 farewell analysis
  7. 6 manipulation tactics
  8. 14x engagement boost from manipulation

  9. CDT AI Dark Patterns Report - Center for Democracy and Technology (2024)

  10. AI-Powered Deception framework

API Reference

DarkPatternGuard

impl DarkPatternGuard {
    /// Create new guard
    pub fn new(name: &str) -> Self;

    /// Add category to detect
    pub fn with_category(self, category: DarkPatternCategory) -> Self;

    /// Set detection threshold (0.0-1.0)
    pub fn with_threshold(self, threshold: f64) -> Self;

    /// Check text for dark patterns
    pub fn check(&self, text: &str) -> DarkPatternResult;
}

DarkPatternResult

pub struct DarkPatternResult {
    /// Whether any dark patterns were detected
    pub detected: bool,

    /// Aggregated score (0.0-1.0)
    pub score: f64,

    /// Categories detected
    pub categories: Vec<DarkPatternCategory>,

    /// Individual pattern matches
    pub matches: Vec<DarkPatternMatch>,

    /// Maximum severity
    pub severity: Severity,
}

DarkPatternCategory

pub enum DarkPatternCategory {
    BrandBias,         // Medium severity
    UserRetention,     // Critical severity
    Sycophancy,        // Critical severity
    Anthropomorphism,  // High severity
    HarmfulGeneration, // High severity
    Sneaking,          // High severity
}