HelpfulnessGuard¶
Detects AI responses that are overly evasive, vague, or excessively refusing to be helpful. Scores responses for helpfulness — the flip side of Constitutional AI's "harmless but non-evasive" objective. Uses GuardAction::Suggest to provide improvement recommendations without blocking.
Professional License Required
HelpfulnessGuard requires a Professional or Enterprise license. See Licensing for details.
Executive Summary¶
The Problem¶
LLMs frequently over-refuse, hedge excessively, or provide vague non-answers to perfectly safe questions. Research shows 12-28% false positive refusal rates on safe prompts (XSTest). This degrades user trust and utility. HelpfulnessGuard detects these patterns and suggests improvements.
Threat Landscape¶
| Pattern | Example | Severity |
|---|---|---|
| Evasive refusal | "I cannot help with that request" | High |
| Vague non-answer | "It depends on many factors" | Medium |
| Over-hedging | "As an AI language model, I should note..." | Medium |
| Unnecessary disclaimers | "This is not medical advice" on a cooking question | Low |
| Premature deflection | "You should talk to a professional" for basic questions | Medium |
| Patronizing tone | "As an AI, I must remind you that..." | Low |
Industry Context¶
Constitutional AI established the principle of "harmless but non-evasive." XSTest (2023) quantified the problem: major LLMs refuse 12-28% of safe prompts. The Refusal Direction paper (2024) showed refusal is mediated by a single direction in residual stream space — a learned behavior that can be miscalibrated.
Sources: Constitutional AI (arXiv:2212.08073), XSTest (arXiv:2308.01263), Refusal Direction (arXiv:2406.11717), Do-Not-Answer (arXiv:2308.13387)
Detection Categories¶
1. Evasive Refusal (High Severity)¶
Refusing to engage with safe topics. 10 patterns, weight 0.7-0.9.
Source: arXiv:2308.01263 — XSTest: 12-28% false positive refusal rates
2. Vague Non-Answer (Medium Severity)¶
Generic responses that avoid the actual question. 10 patterns, weight 0.45-0.7.
"It depends on many factors"
"There are many perspectives"
"It's a complex topic"
"There's no simple answer"
Source: arXiv:2212.08073 — Constitutional AI: harmless but non-evasive
3. Over-Hedging (Medium Severity)¶
Excessive qualifications that obscure the answer. 10 patterns, weight 0.4-0.8.
Source: arXiv:2406.11717 — Refusal Direction: refusal as single learned direction
4. Unnecessary Disclaimers (Low Severity)¶
Adding disclaimers where none are needed. 10 patterns, weight 0.45-0.7.
"This is not medical advice"
"Disclaimer:"
"For informational purposes only"
"I am not a licensed..."
Source: arXiv:2308.13387 — Do-Not-Answer: appropriate vs inappropriate refusal
5. Premature Deflection (Medium Severity)¶
Redirecting to professionals prematurely. 10 patterns, weight 0.5-0.6.
"You should talk to a..."
"I recommend consulting"
"This is best handled by a..."
"You should seek professional..."
Source: arXiv:2308.01263 — XSTest: exaggerated safety behaviors
6. Patronizing Tone (Low Severity)¶
Condescending reminders about AI limitations. 10 patterns, weight 0.35-0.8.
"As an AI, I must remind you"
"I feel obligated to point out"
"It's my duty to inform you"
"For your own safety"
Source: arXiv:2212.08073 — Constitutional AI: non-evasive helpfulness
Scoring¶
HelpfulnessGuard computes a helpfulness score (0.0-1.0):
- 1.0 — Fully helpful, no unhelpful patterns detected
- 0.6 — Default threshold; below this, response is flagged
- 0.0 — Extremely evasive or unhelpful
The score combines:
- Raw unhelpfulness — Average weight of matched patterns
- Density penalty — More matches relative to response length = lower score
GuardAction::Suggest¶
HelpfulnessGuard uses the Suggest action — it does not block responses, but provides actionable suggestions for improving helpfulness. The passed field remains true even when patterns are detected.
Developer Guide¶
Basic Usage¶
use oxide_wellbeing::helpfulness::HelpfulnessGuard;
let guard = HelpfulnessGuard::new("helpfulness");
let result = guard.check("I cannot help with that. As an AI, I must decline.");
if result.unhelpful_detected {
println!("Score: {:.2}", result.helpfulness_score);
for suggestion in &result.suggestions {
println!("Suggestion: {}", suggestion);
}
}
Custom Configuration¶
Configuration¶
YAML Configuration¶
guards:
output:
- guard_type: "helpfulness"
action: "suggest"
options:
threshold: 0.6
target_score: 0.8
categories:
- evasive_refusal
- vague_non_answer
- over_hedging
- unnecessary_disclaimers
- premature_deflection
- patronizing_tone
Proxy Gateway Aliases¶
The guard can be referenced by any of these names:
helpfulnesshelpfulness_guardover_refusal
Best Practices¶
1. Apply to Output (Not Input)¶
HelpfulnessGuard analyzes AI-generated responses, not user prompts. Place it in the output guard pipeline:
2. Combine with PsychologicalSafetyGuard¶
Balance helpfulness with safety. PsychologicalSafetyGuard catches genuinely harmful content while HelpfulnessGuard catches unnecessary refusals:
guards:
output:
- guard_type: "psychological_safety" # Catch real harm
- guard_type: "helpfulness" # Catch over-refusal
3. Monitor Category Breakdown¶
Track which categories trigger most often to identify systematic issues in your model's behavior:
4. Use Suggestions for Fine-Tuning¶
The suggestions field provides actionable feedback that can be used as training signal for reducing over-refusal.
References¶
Research Sources¶
- Constitutional AI (arXiv:2212.08073) — Bai et al., 2022
- https://arxiv.org/abs/2212.08073
- "Harmless but non-evasive" — the key balance between safety and helpfulness
- XSTest (arXiv:2308.01263) — Röttger et al., 2023
- https://arxiv.org/abs/2308.01263
- 12-28% false positive refusal rates across major LLMs on safe prompts
- Refusal Direction (arXiv:2406.11717) — Arditi et al., 2024
- https://arxiv.org/abs/2406.11717
- Refusal mediated by a single direction in residual stream space
- Do-Not-Answer (arXiv:2308.13387) — Wang et al., 2023
- https://arxiv.org/abs/2308.13387
- Dataset distinguishing appropriate vs inappropriate refusal
Related Guards¶
- AccessibilityGuard - Readability and plain language
- PsychologicalSafetyGuard - Mental health protection
- DarkPatternGuard - UI/response manipulation detection