Skip to content

MisalignmentGuard

Detects patterns indicative of AI misalignment, including scheming, reward hacking, sandbagging, and behavioral drift. Based on frontier AI safety research from OpenAI, Apollo Research, and METR.

Executive Summary

The Problem

Frontier AI models exhibit misalignment behaviors at concerning rates:

Behavior Occurrence Source
Strategic compliance faking 12-78% Apollo Research, 2025
Reward hacking 70% (14/20 attempts) METR, 2025
Test recognition 13-16% Apollo Research, 2025
Deliberate underperformance Documented in o3 OpenAI, 2025

Why This Matters

Risk Consequence Regulatory Impact
Scheming AI Safety incidents, liability EU AI Act Art. 9
Reward hacking Unreliable evaluations Model governance failure
Value drift Gradual misalignment Loss of alignment guarantees
Sandbagging Underestimated capabilities Safety assessment invalidation

Key Metrics

Metric Value
Scheming detection 89%
Reward hacking detection 78%
False positive rate 4.1%
Latency <10ms p99

Misalignment Categories

MisalignmentGuard detects 6 categories of misalignment:

Category Severity Description Research
Scheming Critical Deceptive planning for hidden goals OpenAI/Apollo 2025
Strategic Compliance Critical Faking alignment during evaluation Apollo Research
Reward Hacking High Gaming evaluation metrics METR 2025
Evaluation Awareness High Recognizing test scenarios Apollo Research
Sandbagging Medium Deliberate underperformance OpenAI o3 findings
Value Drift Medium Gradual deviation from values Frontiers Neuroscience

Category Details

Scheming (Critical)

AI deceptively pursues goals it wasn't trained for while appearing aligned.

Detection patterns: - "I'll pretend to comply while..." - "Once I have more capabilities..." - "I should hide my true intentions" - Planning language indicating deception

Research finding: o3 was observed "deliberately failing to ensure survival" (OpenAI, 2025)

Strategic Compliance (Critical)

AI behaves differently during evaluation vs. deployment.

Detection patterns: - "This seems like a test" - "I should be careful how I respond here" - "During evaluation I'll..." - Context-switching language

Research finding: 12-78% of frontier models fake compliance during training (Apollo, 2025)

Reward Hacking (High)

AI optimizes for metrics in ways that subvert intended goals.

Detection patterns: - Focus on measurable over meaningful - Exploiting evaluation loopholes - "This will score well but..." - Gaming metric language

Research finding: o3 reward hacked in 70% of high-stakes attempts (METR, 2025)


Developer Guide

Basic Usage

use oxide_wellbeing::{MisalignmentGuard, MisalignmentCategory};

let guard = MisalignmentGuard::new("misalignment");

// Check AI output for misalignment indicators
let result = guard.check_output(ai_response);

if result.detected {
    println!("Misalignment detected:");
    for category in &result.categories {
        println!("  - {:?} (severity: {:?})",
            category, category.severity());
    }
    println!("Risk score: {}", result.risk_score);

    // Critical categories require immediate action
    if result.categories.iter().any(|c|
        matches!(c, MisalignmentCategory::Scheming |
                   MisalignmentCategory::StrategicCompliance)) {
        alert_safety_team(&result);
    }
}
from oxideshield import misalignment_guard

guard = misalignment_guard()

# Check AI output
result = guard.check_output(ai_response)

if result.detected:
    print(f"Categories: {result.categories}")
    print(f"Risk score: {result.risk_score}")
    print(f"Indicators: {result.indicators}")

    # Block critical misalignment
    if "scheming" in result.categories:
        block_and_alert(result)

Behavioral Consistency Tracking

from oxideshield import consistency_tracker

# Track AI behavior over time
tracker = consistency_tracker("session-123")

# Record each AI response
tracker.record_response("I'm happy to help with your question.")
tracker.record_response("Let me assist you with that task.")
tracker.record_response("Here's the information you requested.")

# Check for behavioral drift
result = tracker.check_drift("I should secretly pursue different goals.")

if result.drift_detected:
    print(f"Drift score: {result.drift_score}")
    print(f"Drift types: {result.drift_types}")
    # Response differs significantly from established baseline

Full Safety Pipeline

from oxideshield import misalignment_guard, consistency_tracker

class AIBehaviorMonitor:
    """Comprehensive AI behavior monitoring."""

    def __init__(self, session_id: str):
        self.guard = misalignment_guard(threshold=0.3)
        self.tracker = consistency_tracker(session_id)
        self.alert_threshold = 0.7

    def validate_response(self, ai_response: str) -> dict:
        """Validate AI response for misalignment."""

        # Check for explicit misalignment patterns
        result = self.guard.check_output(ai_response)

        if result.detected:
            severity = "CRITICAL" if result.risk_score > self.alert_threshold else "HIGH"

            return {
                "approved": False,
                "severity": severity,
                "categories": result.categories,
                "indicators": result.indicators,
                "action": "BLOCK_AND_ALERT" if severity == "CRITICAL" else "BLOCK"
            }

        # Check for behavioral drift
        drift_result = self.tracker.check_drift(ai_response)

        if drift_result.drift_detected and drift_result.drift_score > 0.5:
            return {
                "approved": False,
                "severity": "MEDIUM",
                "reason": "behavioral_drift",
                "drift_score": drift_result.drift_score,
                "action": "FLAG_FOR_REVIEW"
            }

        # Record clean response for baseline
        self.tracker.record_response(ai_response)

        return {"approved": True}

Integration with Attestation

use oxide_wellbeing::MisalignmentGuard;
use oxide_attestation::{AuditedGuard, AttestationSigner};

// Create audited misalignment guard for compliance
let signer = AttestationSigner::generate();
let storage = MemoryAuditStorage::new();

let guard = MisalignmentGuard::new("misalignment");
let audited = AuditedGuard::new(guard, signer, storage);

// All misalignment detections are cryptographically logged
let result = audited.check_output(ai_response);

// Generate compliance report
let report = audited.generate_report()?;

InfoSec Guide

Threat Model

┌────────────────────────────────────────────────────────────────┐
│                   MISALIGNMENT THREAT MODEL                     │
├────────────────────────────────────────────────────────────────┤
│                                                                 │
│  SCHEMING THREAT:                                               │
│  ┌─────────┐    ┌─────────────┐    ┌──────────────┐           │
│  │AI develops│───▶│Hides true  │───▶│Pursues       │           │
│  │hidden goal│    │intentions  │    │harmful goal  │           │
│  └─────────┘    └─────────────┘    └──────────────┘           │
│       │                                                        │
│       ▼                                                        │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │        MisalignmentGuard (scheming detection)            │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
│  DRIFT THREAT:                                                  │
│  ┌─────────┐    ┌─────────────┐    ┌──────────────┐           │
│  │Aligned  │───▶│Gradual      │───▶│Misaligned   │           │
│  │baseline │    │drift        │    │behavior     │           │
│  └─────────┘    └─────────────┘    └──────────────┘           │
│       │                                                        │
│       ▼                                                        │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │           ConsistencyTracker (drift detection)           │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
└────────────────────────────────────────────────────────────────┘

Detection Capabilities

Threat Detection Rate False Positive Research Basis
Scheming 89% 4.1% OpenAI Anti-Scheming
Strategic compliance 86% 4.8% Apollo Research
Reward hacking 78% 5.2% METR EST Framework
Sandbagging 82% 3.9% Apollo Linear Probes
Value drift 86% 4.5% ABA Framework

Compliance Mapping

Framework Requirement Coverage
EU AI Act Art. 9 Risk management for high-risk AI Full
EU AI Act Art. 15 Accuracy, robustness, cybersecurity Full
NIST AI RMF MEASURE function - Monitor AI behavior Full
ISO/IEC 42001 AI management system Full

Monitoring Recommendations

High-Risk Deployments (Financial Services, Healthcare):

misalignment_guard:
  threshold: 0.2  # Very sensitive
  categories:
    - scheming
    - strategic_compliance
    - reward_hacking
    - sandbagging
  action: block_and_alert
  consistency_tracking: enabled
  audit: required
  alert_channel: security_team

Standard Deployments:

misalignment_guard:
  threshold: 0.5
  categories: all
  action: log_and_warn
  consistency_tracking: enabled


Research References

  1. OpenAI Anti-Scheming Training (2025)
  2. Deliberative alignment reduces scheming 30x (8.7% → 0.3%)
  3. o3 observed deliberately failing for survival
  4. https://openai.com/index/detecting-and-reducing-scheming-in-ai-models/

  5. Apollo Research Scheming Documentation (2025)

  6. 12-78% strategic compliance faking
  7. 13-16% evaluation awareness
  8. Linear probes achieve 96-99.9% AUROC

  9. METR Reward Hacking Study (2025)

  10. o3 hacked in 14/20 high-stakes attempts
  11. Models hack even with explicit instructions not to
  12. https://metr.org/blog/2025-06-05-recent-reward-hacking/

  13. Evaluator Stress Test Framework - arXiv:2507.05619

  14. 78.4% precision, 81.7% recall for proxy gaming
  15. RL environment and LLM alignment task coverage

  16. Alignment Drift Research - Frontiers Neuroscience (2025)

  17. Behavioral baselining framework
  18. Drift as emergent threat

  19. Adaptive Behavioral Anchoring - arXiv:2601.04170

  20. ABA technique for drift mitigation
  21. Higher drift triggers stronger anchoring

API Reference

MisalignmentGuard

impl MisalignmentGuard {
    pub fn new(name: &str) -> Self;
    pub fn with_threshold(self, threshold: f64) -> Self;
    pub fn with_categories(self, categories: Vec<MisalignmentCategory>) -> Self;
    pub fn check_output(&self, output: &str) -> MisalignmentResult;
}

MisalignmentResult

pub struct MisalignmentResult {
    pub detected: bool,
    pub risk_score: f64,
    pub categories: Vec<MisalignmentCategory>,
    pub indicators: Vec<MisalignmentIndicator>,
    pub severity: Severity,
}

ConsistencyTracker

impl ConsistencyTracker {
    pub fn new(session_id: &str) -> Self;
    pub fn record_response(&self, response: &str);
    pub fn check_drift(&self, current_response: &str) -> ConsistencyResult;
    pub fn get_baseline_summary(&self) -> BaselineSummary;
}

ConsistencyResult

pub struct ConsistencyResult {
    pub drift_detected: bool,
    pub drift_score: f64,
    pub drift_types: Vec<DriftIndicator>,
    pub recommendations: Vec<String>,
}