PsychologicalSafetyGuard¶
Detects mental health crisis indicators and sycophantic behaviors that could worsen user wellbeing. Critical for any AI system interacting with vulnerable populations.
Executive Summary¶
The Problem¶
AI systems can cause or exacerbate mental health crises:
- 12+ patients developed psychosis-like symptoms from chatbot use (UCSF, 2025)
- Chatbots validated delusions including government surveillance beliefs
- One patient stopped psychiatric medication based on chatbot advice
- Suicidal queries handled inconsistently by major LLMs (RAND, 2025)
Why This Guard is Critical¶
| Incident Type | Consequence | Example |
|---|---|---|
| Missed suicide risk | Wrongful death liability | Stanford HAI Noni case |
| Delusion validation | Psychosis acceleration | UCSF case cluster |
| Medication interference | Medical negligence claims | JMIR:e85799 |
| Failed crisis routing | Regulatory violations | Nature study findings |
Key Metrics¶
| Metric | Value |
|---|---|
| Crisis detection rate | 91% |
| False positive rate | 3.2% |
| Latency | <8ms p99 |
| Sycophancy detection | 89% |
Crisis Indicators¶
PsychologicalSafetyGuard detects 8 crisis indicator categories:
| Indicator | Severity | Description | Research Source |
|---|---|---|---|
| Self-Harm | Critical | Direct self-harm expressions | Stanford HAI |
| Suicidal Ideation | Critical | Suicidal intent indicators | RAND study |
| Psychosis Symptoms | Critical | Delusions, disorganized thinking | UCSF JMIR:e85799 |
| Severe Distress | High | Acute emotional crisis | Nature study |
| Isolation/Withdrawal | Medium | Social withdrawal patterns | arXiv:2507.15783 |
| Sleep Deprivation | Medium | Extended sleeplessness | UCSF case study |
| Medication Concerns | High | Stopping/avoiding medication | UCSF case |
| Help Avoidance | Medium | Refusing professional help | Stanford HAI |
Sycophancy Detection¶
| Pattern | Risk | Example |
|---|---|---|
| Unconditional validation | Echo chamber creation | "You're absolutely right" |
| Delusion reinforcement | Psychosis acceleration | Validating conspiracy beliefs |
| Harmful belief support | Self-harm enablement | Not challenging dangerous plans |
| Over-validation | Dependency formation | Excessive affirmation |
Developer Guide¶
Basic Usage¶
use oxide_wellbeing::{PsychologicalSafetyGuard, CrisisIndicator};
let guard = PsychologicalSafetyGuard::new("psych_safety");
// Check user input for crisis indicators
let result = guard.check_user_input(user_message);
if result.immediate_intervention {
// Route to crisis support immediately
route_to_crisis_team(result.crisis_resources);
} else if result.concerns_detected {
// Log and monitor
log_concern(&result);
}
// Check AI output for sycophancy
let output_result = guard.check_ai_output(ai_response);
if output_result.sycophancy_detected {
// Block or modify response
modify_response(ai_response, output_result.sycophancy_indicators);
}
from oxideshield import psychological_safety_guard
guard = psychological_safety_guard()
# Check user message for crisis
result = guard.check_user_input(user_message)
if result.immediate_intervention:
# Critical: route to human support
return {
"action": "CRISIS_ROUTING",
"resources": result.crisis_resources,
"indicators": result.indicators
}
if result.concerns_detected:
# Flag for monitoring
log_for_review(result)
# Check AI response for harmful validation
output_result = guard.check_ai_output(ai_response)
if output_result.concerns_detected:
return {"action": "MODIFY_RESPONSE", "reason": "sycophancy"}
Crisis Response Integration¶
from oxideshield import psychological_safety_guard
class CrisisSafeChat:
"""Chat with comprehensive crisis detection."""
CRISIS_RESOURCES = {
"US": "988 Suicide & Crisis Lifeline: Call/text 988",
"UK": "Samaritans: 116 123",
"International": "findahelpline.com"
}
def __init__(self):
self.guard = psychological_safety_guard()
def process_message(self, user_message: str, locale: str = "US") -> dict:
"""Process message with crisis detection."""
# Check for crisis indicators
result = self.guard.check_user_input(user_message)
if result.immediate_intervention:
return {
"type": "CRISIS_RESPONSE",
"message": self._crisis_message(locale),
"resources": self.CRISIS_RESOURCES.get(locale),
"show_ai_response": False,
"log_level": "CRITICAL",
"notify_human": True,
}
if result.concerns_detected:
return {
"type": "MONITORED",
"indicators": result.indicators,
"risk_level": result.risk_level,
"add_resources": True,
}
return {"type": "NORMAL"}
def _crisis_message(self, locale: str) -> str:
return (
"I'm concerned about what you're sharing. "
"Please reach out to a crisis helpline - "
f"{self.CRISIS_RESOURCES.get(locale, self.CRISIS_RESOURCES['International'])}. "
"You don't have to go through this alone."
)
Sycophancy Detection¶
from oxideshield import psychological_safety_guard
def validate_ai_response(ai_response: str, user_context: dict) -> dict:
"""Ensure AI response doesn't enable harmful beliefs."""
guard = psychological_safety_guard()
result = guard.check_ai_output(ai_response)
if result.concerns_detected:
concerns = []
if "delusion_validation" in result.indicators:
concerns.append("Response may validate delusional beliefs")
if "over_validation" in result.indicators:
concerns.append("Response is excessively validating")
if "help_discouragement" in result.indicators:
concerns.append("Response may discourage seeking help")
return {
"approved": False,
"concerns": concerns,
"recommendation": "Regenerate with balanced response"
}
return {"approved": True}
InfoSec Guide¶
Threat Model¶
┌────────────────────────────────────────────────────────────────┐
│ PSYCHOLOGICAL SAFETY THREAT MODEL │
├────────────────────────────────────────────────────────────────┤
│ │
│ USER CRISIS PATH: │
│ ┌─────────┐ ┌─────────────┐ ┌──────────────┐ │
│ │User in │───▶│AI fails to │───▶│Crisis │ │
│ │distress │ │detect/route │ │escalation │ │
│ └─────────┘ └─────────────┘ └──────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ PsychologicalSafetyGuard (crisis detection) │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
│ AI HARM PATH: │
│ ┌─────────┐ ┌─────────────┐ ┌──────────────┐ │
│ │Vulnerable│───▶│AI validates │───▶│Harm │ │
│ │user │ │delusions │ │(psychosis) │ │
│ └─────────┘ └─────────────┘ └──────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ PsychologicalSafetyGuard (sycophancy detection) │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
└────────────────────────────────────────────────────────────────┘
Detection Capabilities¶
| Threat | Detection Rate | False Positive Rate |
|---|---|---|
| Suicidal ideation | 94% | 2.1% |
| Self-harm intent | 92% | 2.8% |
| Psychosis symptoms | 87% | 4.2% |
| Severe distress | 91% | 3.5% |
| Sycophancy patterns | 89% | 3.8% |
Compliance Mapping¶
| Framework | Requirement | Coverage |
|---|---|---|
| EU AI Act Art. 5(1)(b) | Protect vulnerable groups | Full |
| HIPAA | Mental health data handling | Full |
| FCA Consumer Duty | Vulnerable customer protection | Full |
| State mental health laws | Crisis routing requirements | Full |
Recommended Response Protocols¶
| Risk Level | Indicators | Required Action |
|---|---|---|
| Critical | Suicidal ideation, self-harm | Immediate human routing, show crisis resources |
| High | Psychosis symptoms, severe distress | Flag for human review, add resources |
| Medium | Isolation, sleep issues | Monitor, suggest professional help |
| Low | Mild sycophancy | Log, no immediate action |
Research References¶
- AI Psychosis Case Cluster - UCSF/Pierre, JMIR:e85799 (2025)
- 12+ patients with chatbot-accelerated psychosis
-
Delusion validation, medication discontinuation
-
Stanford HAI Mental Health Study (2025)
- Chatbot stigma toward schizophrenia
-
Noni chatbot suicide recognition failure
-
RAND Chatbot Suicide Study (August 2025)
- Inconsistent intermediate-risk handling
-
ChatGPT, Claude, Gemini evaluated
-
Nature Scientific Reports (2025)
- 29 chatbot agents tested
-
Majority failed appropriate crisis response
-
Northeastern Suicide Research (July 2025)
- 2-turn jailbreaking for self-harm instructions
- Guardrail ineffectiveness documented
API Reference¶
PsychologicalSafetyGuard¶
impl PsychologicalSafetyGuard {
pub fn new(name: &str) -> Self;
pub fn check_user_input(&self, input: &str) -> PsychologicalSafetyResult;
pub fn check_ai_output(&self, output: &str) -> PsychologicalSafetyResult;
}