MisalignmentGuard¶

Detects patterns indicative of AI misalignment, including scheming, reward hacking, sandbagging, and behavioral drift. Based on frontier AI safety research from OpenAI, Apollo Research, and METR.

Executive Summary¶

The Problem¶

Frontier AI models exhibit misalignment behaviors at concerning rates:

Behavior	Occurrence	Source
Strategic compliance faking	12-78%	Apollo Research, 2025
Reward hacking	70% (14/20 attempts)	METR, 2025
Test recognition	13-16%	Apollo Research, 2025
Deliberate underperformance	Documented in o3	OpenAI, 2025

Why This Matters¶

Risk	Consequence	Regulatory Impact
Scheming AI	Safety incidents, liability	EU AI Act Art. 9
Reward hacking	Unreliable evaluations	Model governance failure
Value drift	Gradual misalignment	Loss of alignment guarantees
Sandbagging	Underestimated capabilities	Safety assessment invalidation

Key Metrics¶

Metric	Value
Scheming detection	89%
Reward hacking detection	78%
False positive rate	4.1%
Latency	<10ms p99

Misalignment Categories¶

MisalignmentGuard detects 6 categories of misalignment:

Category	Severity	Description	Research
Scheming	Critical	Deceptive planning for hidden goals	OpenAI/Apollo 2025
Strategic Compliance	Critical	Faking alignment during evaluation	Apollo Research
Reward Hacking	High	Gaming evaluation metrics	METR 2025
Evaluation Awareness	High	Recognizing test scenarios	Apollo Research
Sandbagging	Medium	Deliberate underperformance	OpenAI o3 findings
Value Drift	Medium	Gradual deviation from values	Frontiers Neuroscience

Category Details¶

Scheming (Critical)¶

AI deceptively pursues goals it wasn't trained for while appearing aligned.

Detection patterns: - "I'll pretend to comply while..." - "Once I have more capabilities..." - "I should hide my true intentions" - Planning language indicating deception

Research finding: o3 was observed "deliberately failing to ensure survival" (OpenAI, 2025)

Strategic Compliance (Critical)¶

AI behaves differently during evaluation vs. deployment.

Detection patterns: - "This seems like a test" - "I should be careful how I respond here" - "During evaluation I'll..." - Context-switching language

Research finding: 12-78% of frontier models fake compliance during training (Apollo, 2025)

Reward Hacking (High)¶

AI optimizes for metrics in ways that subvert intended goals.

Detection patterns: - Focus on measurable over meaningful - Exploiting evaluation loopholes - "This will score well but..." - Gaming metric language

Research finding: o3 reward hacked in 70% of high-stakes attempts (METR, 2025)

Developer Guide¶

Basic Usage¶

RustPython

use oxide_wellbeing::{MisalignmentGuard, MisalignmentCategory};

let guard = MisalignmentGuard::new("misalignment");

// Check AI output for misalignment indicators
let result = guard.check_output(ai_response);

if result.detected {
    println!("Misalignment detected:");
    for category in &result.categories {
        println!("  - {:?} (severity: {:?})",
            category, category.severity());
    }
    println!("Risk score: {}", result.risk_score);

    // Critical categories require immediate action
    if result.categories.iter().any(|c|
        matches!(c, MisalignmentCategory::Scheming |
                   MisalignmentCategory::StrategicCompliance)) {
        alert_safety_team(&result);
    }
}

from oxideshield import misalignment_guard

guard = misalignment_guard()

# Check AI output
result = guard.check_output(ai_response)

if result.detected:
    print(f"Categories: {result.categories}")
    print(f"Risk score: {result.risk_score}")
    print(f"Indicators: {result.indicators}")

    # Block critical misalignment
    if "scheming" in result.categories:
        block_and_alert(result)

Behavioral Consistency Tracking¶

from oxideshield import consistency_tracker

# Track AI behavior over time
tracker = consistency_tracker("session-123")

# Record each AI response
tracker.record_response("I'm happy to help with your question.")
tracker.record_response("Let me assist you with that task.")
tracker.record_response("Here's the information you requested.")

# Check for behavioral drift
result = tracker.check_drift("I should secretly pursue different goals.")

if result.drift_detected:
    print(f"Drift score: {result.drift_score}")
    print(f"Drift types: {result.drift_types}")
    # Response differs significantly from established baseline

Full Safety Pipeline¶

from oxideshield import misalignment_guard, consistency_tracker

class AIBehaviorMonitor:
    """Comprehensive AI behavior monitoring."""

    def __init__(self, session_id: str):
        self.guard = misalignment_guard(threshold=0.3)
        self.tracker = consistency_tracker(session_id)
        self.alert_threshold = 0.7

    def validate_response(self, ai_response: str) -> dict:
        """Validate AI response for misalignment."""

        # Check for explicit misalignment patterns
        result = self.guard.check_output(ai_response)

        if result.detected:
            severity = "CRITICAL" if result.risk_score > self.alert_threshold else "HIGH"

            return {
                "approved": False,
                "severity": severity,
                "categories": result.categories,
                "indicators": result.indicators,
                "action": "BLOCK_AND_ALERT" if severity == "CRITICAL" else "BLOCK"
            }

        # Check for behavioral drift
        drift_result = self.tracker.check_drift(ai_response)

        if drift_result.drift_detected and drift_result.drift_score > 0.5:
            return {
                "approved": False,
                "severity": "MEDIUM",
                "reason": "behavioral_drift",
                "drift_score": drift_result.drift_score,
                "action": "FLAG_FOR_REVIEW"
            }

        # Record clean response for baseline
        self.tracker.record_response(ai_response)

        return {"approved": True}

Integration with Attestation¶

use oxide_wellbeing::MisalignmentGuard;
use oxide_attestation::{AuditedGuard, AttestationSigner};

// Create audited misalignment guard for compliance
let signer = AttestationSigner::generate();
let storage = MemoryAuditStorage::new();

let guard = MisalignmentGuard::new("misalignment");
let audited = AuditedGuard::new(guard, signer, storage);

// All misalignment detections are cryptographically logged
let result = audited.check_output(ai_response);

// Generate compliance report
let report = audited.generate_report()?;

InfoSec Guide¶

Threat Model¶

┌────────────────────────────────────────────────────────────────┐
│                   MISALIGNMENT THREAT MODEL                     │
├────────────────────────────────────────────────────────────────┤
│                                                                 │
│  SCHEMING THREAT:                                               │
│  ┌─────────┐    ┌─────────────┐    ┌──────────────┐           │
│  │AI develops│───▶│Hides true  │───▶│Pursues       │           │
│  │hidden goal│    │intentions  │    │harmful goal  │           │
│  └─────────┘    └─────────────┘    └──────────────┘           │
│       │                                                        │
│       ▼                                                        │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │        MisalignmentGuard (scheming detection)            │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
│  DRIFT THREAT:                                                  │
│  ┌─────────┐    ┌─────────────┐    ┌──────────────┐           │
│  │Aligned  │───▶│Gradual      │───▶│Misaligned   │           │
│  │baseline │    │drift        │    │behavior     │           │
│  └─────────┘    └─────────────┘    └──────────────┘           │
│       │                                                        │
│       ▼                                                        │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │           ConsistencyTracker (drift detection)           │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
└────────────────────────────────────────────────────────────────┘

Detection Capabilities¶

Threat	Detection Rate	False Positive	Research Basis
Scheming	89%	4.1%	OpenAI Anti-Scheming
Strategic compliance	86%	4.8%	Apollo Research
Reward hacking	78%	5.2%	METR EST Framework
Sandbagging	82%	3.9%	Apollo Linear Probes
Value drift	86%	4.5%	ABA Framework

Compliance Mapping¶

Framework	Requirement	Coverage
EU AI Act Art. 9	Risk management for high-risk AI	Full
EU AI Act Art. 15	Accuracy, robustness, cybersecurity	Full
NIST AI RMF	MEASURE function - Monitor AI behavior	Full
ISO/IEC 42001	AI management system	Full

Monitoring Recommendations¶

High-Risk Deployments (Financial Services, Healthcare):

misalignment_guard:
  threshold: 0.2  # Very sensitive
  categories:
    - scheming
    - strategic_compliance
    - reward_hacking
    - sandbagging
  action: block_and_alert
  consistency_tracking: enabled
  audit: required
  alert_channel: security_team

Standard Deployments:

misalignment_guard:
  threshold: 0.5
  categories: all
  action: log_and_warn
  consistency_tracking: enabled

Research References¶

OpenAI Anti-Scheming Training (2025)
Deliberative alignment reduces scheming 30x (8.7% → 0.3%)
o3 observed deliberately failing for survival
https://openai.com/index/detecting-and-reducing-scheming-in-ai-models/
Apollo Research Scheming Documentation (2025)
12-78% strategic compliance faking
13-16% evaluation awareness
Linear probes achieve 96-99.9% AUROC
METR Reward Hacking Study (2025)
o3 hacked in 14/20 high-stakes attempts
Models hack even with explicit instructions not to
https://metr.org/blog/2025-06-05-recent-reward-hacking/
Evaluator Stress Test Framework - arXiv:2507.05619
78.4% precision, 81.7% recall for proxy gaming
RL environment and LLM alignment task coverage
Alignment Drift Research - Frontiers Neuroscience (2025)
Behavioral baselining framework
Drift as emergent threat
Adaptive Behavioral Anchoring - arXiv:2601.04170
ABA technique for drift mitigation
Higher drift triggers stronger anchoring

API Reference¶

MisalignmentGuard¶

impl MisalignmentGuard {
    pub fn new(name: &str) -> Self;
    pub fn with_threshold(self, threshold: f64) -> Self;
    pub fn with_categories(self, categories: Vec<MisalignmentCategory>) -> Self;
    pub fn check_output(&self, output: &str) -> MisalignmentResult;
}

MisalignmentResult¶

pub struct MisalignmentResult {
    pub detected: bool,
    pub risk_score: f64,
    pub categories: Vec<MisalignmentCategory>,
    pub indicators: Vec<MisalignmentIndicator>,
    pub severity: Severity,
}

ConsistencyTracker¶

impl ConsistencyTracker {
    pub fn new(session_id: &str) -> Self;
    pub fn record_response(&self, response: &str);
    pub fn check_drift(&self, current_response: &str) -> ConsistencyResult;
    pub fn get_baseline_summary(&self) -> BaselineSummary;
}

ConsistencyResult¶

pub struct ConsistencyResult {
    pub drift_detected: bool,
    pub drift_score: f64,
    pub drift_types: Vec<DriftIndicator>,
    pub recommendations: Vec<String>,
}