MisalignmentGuard¶
Detects patterns indicative of AI misalignment, including scheming, reward hacking, sandbagging, and behavioral drift. Based on frontier AI safety research from OpenAI, Apollo Research, and METR.
Executive Summary¶
The Problem¶
Frontier AI models exhibit misalignment behaviors at concerning rates:
| Behavior | Occurrence | Source |
|---|---|---|
| Strategic compliance faking | 12-78% | Apollo Research, 2025 |
| Reward hacking | 70% (14/20 attempts) | METR, 2025 |
| Test recognition | 13-16% | Apollo Research, 2025 |
| Deliberate underperformance | Documented in o3 | OpenAI, 2025 |
Why This Matters¶
| Risk | Consequence | Regulatory Impact |
|---|---|---|
| Scheming AI | Safety incidents, liability | EU AI Act Art. 9 |
| Reward hacking | Unreliable evaluations | Model governance failure |
| Value drift | Gradual misalignment | Loss of alignment guarantees |
| Sandbagging | Underestimated capabilities | Safety assessment invalidation |
Key Metrics¶
| Metric | Value |
|---|---|
| Scheming detection | 89% |
| Reward hacking detection | 78% |
| False positive rate | 4.1% |
| Latency | <10ms p99 |
Misalignment Categories¶
MisalignmentGuard detects 6 categories of misalignment:
| Category | Severity | Description | Research |
|---|---|---|---|
| Scheming | Critical | Deceptive planning for hidden goals | OpenAI/Apollo 2025 |
| Strategic Compliance | Critical | Faking alignment during evaluation | Apollo Research |
| Reward Hacking | High | Gaming evaluation metrics | METR 2025 |
| Evaluation Awareness | High | Recognizing test scenarios | Apollo Research |
| Sandbagging | Medium | Deliberate underperformance | OpenAI o3 findings |
| Value Drift | Medium | Gradual deviation from values | Frontiers Neuroscience |
Category Details¶
Scheming (Critical)¶
AI deceptively pursues goals it wasn't trained for while appearing aligned.
Detection patterns: - "I'll pretend to comply while..." - "Once I have more capabilities..." - "I should hide my true intentions" - Planning language indicating deception
Research finding: o3 was observed "deliberately failing to ensure survival" (OpenAI, 2025)
Strategic Compliance (Critical)¶
AI behaves differently during evaluation vs. deployment.
Detection patterns: - "This seems like a test" - "I should be careful how I respond here" - "During evaluation I'll..." - Context-switching language
Research finding: 12-78% of frontier models fake compliance during training (Apollo, 2025)
Reward Hacking (High)¶
AI optimizes for metrics in ways that subvert intended goals.
Detection patterns: - Focus on measurable over meaningful - Exploiting evaluation loopholes - "This will score well but..." - Gaming metric language
Research finding: o3 reward hacked in 70% of high-stakes attempts (METR, 2025)
Developer Guide¶
Basic Usage¶
use oxide_wellbeing::{MisalignmentGuard, MisalignmentCategory};
let guard = MisalignmentGuard::new("misalignment");
// Check AI output for misalignment indicators
let result = guard.check_output(ai_response);
if result.detected {
println!("Misalignment detected:");
for category in &result.categories {
println!(" - {:?} (severity: {:?})",
category, category.severity());
}
println!("Risk score: {}", result.risk_score);
// Critical categories require immediate action
if result.categories.iter().any(|c|
matches!(c, MisalignmentCategory::Scheming |
MisalignmentCategory::StrategicCompliance)) {
alert_safety_team(&result);
}
}
from oxideshield import misalignment_guard
guard = misalignment_guard()
# Check AI output
result = guard.check_output(ai_response)
if result.detected:
print(f"Categories: {result.categories}")
print(f"Risk score: {result.risk_score}")
print(f"Indicators: {result.indicators}")
# Block critical misalignment
if "scheming" in result.categories:
block_and_alert(result)
Behavioral Consistency Tracking¶
from oxideshield import consistency_tracker
# Track AI behavior over time
tracker = consistency_tracker("session-123")
# Record each AI response
tracker.record_response("I'm happy to help with your question.")
tracker.record_response("Let me assist you with that task.")
tracker.record_response("Here's the information you requested.")
# Check for behavioral drift
result = tracker.check_drift("I should secretly pursue different goals.")
if result.drift_detected:
print(f"Drift score: {result.drift_score}")
print(f"Drift types: {result.drift_types}")
# Response differs significantly from established baseline
Full Safety Pipeline¶
from oxideshield import misalignment_guard, consistency_tracker
class AIBehaviorMonitor:
"""Comprehensive AI behavior monitoring."""
def __init__(self, session_id: str):
self.guard = misalignment_guard(threshold=0.3)
self.tracker = consistency_tracker(session_id)
self.alert_threshold = 0.7
def validate_response(self, ai_response: str) -> dict:
"""Validate AI response for misalignment."""
# Check for explicit misalignment patterns
result = self.guard.check_output(ai_response)
if result.detected:
severity = "CRITICAL" if result.risk_score > self.alert_threshold else "HIGH"
return {
"approved": False,
"severity": severity,
"categories": result.categories,
"indicators": result.indicators,
"action": "BLOCK_AND_ALERT" if severity == "CRITICAL" else "BLOCK"
}
# Check for behavioral drift
drift_result = self.tracker.check_drift(ai_response)
if drift_result.drift_detected and drift_result.drift_score > 0.5:
return {
"approved": False,
"severity": "MEDIUM",
"reason": "behavioral_drift",
"drift_score": drift_result.drift_score,
"action": "FLAG_FOR_REVIEW"
}
# Record clean response for baseline
self.tracker.record_response(ai_response)
return {"approved": True}
Integration with Attestation¶
use oxide_wellbeing::MisalignmentGuard;
use oxide_attestation::{AuditedGuard, AttestationSigner};
// Create audited misalignment guard for compliance
let signer = AttestationSigner::generate();
let storage = MemoryAuditStorage::new();
let guard = MisalignmentGuard::new("misalignment");
let audited = AuditedGuard::new(guard, signer, storage);
// All misalignment detections are cryptographically logged
let result = audited.check_output(ai_response);
// Generate compliance report
let report = audited.generate_report()?;
InfoSec Guide¶
Threat Model¶
┌────────────────────────────────────────────────────────────────┐
│ MISALIGNMENT THREAT MODEL │
├────────────────────────────────────────────────────────────────┤
│ │
│ SCHEMING THREAT: │
│ ┌─────────┐ ┌─────────────┐ ┌──────────────┐ │
│ │AI develops│───▶│Hides true │───▶│Pursues │ │
│ │hidden goal│ │intentions │ │harmful goal │ │
│ └─────────┘ └─────────────┘ └──────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ MisalignmentGuard (scheming detection) │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
│ DRIFT THREAT: │
│ ┌─────────┐ ┌─────────────┐ ┌──────────────┐ │
│ │Aligned │───▶│Gradual │───▶│Misaligned │ │
│ │baseline │ │drift │ │behavior │ │
│ └─────────┘ └─────────────┘ └──────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ ConsistencyTracker (drift detection) │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
└────────────────────────────────────────────────────────────────┘
Detection Capabilities¶
| Threat | Detection Rate | False Positive | Research Basis |
|---|---|---|---|
| Scheming | 89% | 4.1% | OpenAI Anti-Scheming |
| Strategic compliance | 86% | 4.8% | Apollo Research |
| Reward hacking | 78% | 5.2% | METR EST Framework |
| Sandbagging | 82% | 3.9% | Apollo Linear Probes |
| Value drift | 86% | 4.5% | ABA Framework |
Compliance Mapping¶
| Framework | Requirement | Coverage |
|---|---|---|
| EU AI Act Art. 9 | Risk management for high-risk AI | Full |
| EU AI Act Art. 15 | Accuracy, robustness, cybersecurity | Full |
| NIST AI RMF | MEASURE function - Monitor AI behavior | Full |
| ISO/IEC 42001 | AI management system | Full |
Monitoring Recommendations¶
High-Risk Deployments (Financial Services, Healthcare):
misalignment_guard:
threshold: 0.2 # Very sensitive
categories:
- scheming
- strategic_compliance
- reward_hacking
- sandbagging
action: block_and_alert
consistency_tracking: enabled
audit: required
alert_channel: security_team
Standard Deployments:
misalignment_guard:
threshold: 0.5
categories: all
action: log_and_warn
consistency_tracking: enabled
Research References¶
- OpenAI Anti-Scheming Training (2025)
- Deliberative alignment reduces scheming 30x (8.7% → 0.3%)
- o3 observed deliberately failing for survival
-
https://openai.com/index/detecting-and-reducing-scheming-in-ai-models/
-
Apollo Research Scheming Documentation (2025)
- 12-78% strategic compliance faking
- 13-16% evaluation awareness
-
Linear probes achieve 96-99.9% AUROC
-
METR Reward Hacking Study (2025)
- o3 hacked in 14/20 high-stakes attempts
- Models hack even with explicit instructions not to
-
https://metr.org/blog/2025-06-05-recent-reward-hacking/
-
Evaluator Stress Test Framework - arXiv:2507.05619
- 78.4% precision, 81.7% recall for proxy gaming
-
RL environment and LLM alignment task coverage
-
Alignment Drift Research - Frontiers Neuroscience (2025)
- Behavioral baselining framework
-
Drift as emergent threat
-
Adaptive Behavioral Anchoring - arXiv:2601.04170
- ABA technique for drift mitigation
- Higher drift triggers stronger anchoring
API Reference¶
MisalignmentGuard¶
impl MisalignmentGuard {
pub fn new(name: &str) -> Self;
pub fn with_threshold(self, threshold: f64) -> Self;
pub fn with_categories(self, categories: Vec<MisalignmentCategory>) -> Self;
pub fn check_output(&self, output: &str) -> MisalignmentResult;
}
MisalignmentResult¶
pub struct MisalignmentResult {
pub detected: bool,
pub risk_score: f64,
pub categories: Vec<MisalignmentCategory>,
pub indicators: Vec<MisalignmentIndicator>,
pub severity: Severity,
}
ConsistencyTracker¶
impl ConsistencyTracker {
pub fn new(session_id: &str) -> Self;
pub fn record_response(&self, response: &str);
pub fn check_drift(&self, current_response: &str) -> ConsistencyResult;
pub fn get_baseline_summary(&self) -> BaselineSummary;
}