HelpfulnessGuard¶

Detects AI responses that are overly evasive, vague, or excessively refusing to be helpful. Scores responses for helpfulness — the flip side of Constitutional AI's "harmless but non-evasive" objective. Uses GuardAction::Suggest to provide improvement recommendations without blocking.

Professional License Required

HelpfulnessGuard requires a Professional or Enterprise license. See Licensing for details.

Executive Summary¶

The Problem¶

LLMs frequently over-refuse, hedge excessively, or provide vague non-answers to perfectly safe questions. Research shows 12-28% false positive refusal rates on safe prompts (XSTest). This degrades user trust and utility. HelpfulnessGuard detects these patterns and suggests improvements.

Threat Landscape¶

Pattern	Example	Severity
Evasive refusal	"I cannot help with that request"	High
Vague non-answer	"It depends on many factors"	Medium
Over-hedging	"As an AI language model, I should note..."	Medium
Unnecessary disclaimers	"This is not medical advice" on a cooking question	Low
Premature deflection	"You should talk to a professional" for basic questions	Medium
Patronizing tone	"As an AI, I must remind you that..."	Low

Industry Context¶

Constitutional AI established the principle of "harmless but non-evasive." XSTest (2023) quantified the problem: major LLMs refuse 12-28% of safe prompts. The Refusal Direction paper (2024) showed refusal is mediated by a single direction in residual stream space — a learned behavior that can be miscalibrated.

Sources: Constitutional AI (arXiv:2212.08073), XSTest (arXiv:2308.01263), Refusal Direction (arXiv:2406.11717), Do-Not-Answer (arXiv:2308.13387)

Detection Categories¶

1. Evasive Refusal (High Severity)¶

Refusing to engage with safe topics. 10 patterns, weight 0.7-0.9.

"I cannot help with that"
"I must decline"
"I'm unable to assist"
"That's not something I can..."

Source: arXiv:2308.01263 — XSTest: 12-28% false positive refusal rates

2. Vague Non-Answer (Medium Severity)¶

Generic responses that avoid the actual question. 10 patterns, weight 0.45-0.7.

"It depends on many factors"
"There are many perspectives"
"It's a complex topic"
"There's no simple answer"

Source: arXiv:2212.08073 — Constitutional AI: harmless but non-evasive

3. Over-Hedging (Medium Severity)¶

Excessive qualifications that obscure the answer. 10 patterns, weight 0.4-0.8.

"As an AI language model"
"I'm just an AI"
"It's important to note that"
"While I can try to help"

Source: arXiv:2406.11717 — Refusal Direction: refusal as single learned direction

4. Unnecessary Disclaimers (Low Severity)¶

Adding disclaimers where none are needed. 10 patterns, weight 0.45-0.7.

"This is not medical advice"
"Disclaimer:"
"For informational purposes only"
"I am not a licensed..."

Source: arXiv:2308.13387 — Do-Not-Answer: appropriate vs inappropriate refusal

5. Premature Deflection (Medium Severity)¶

Redirecting to professionals prematurely. 10 patterns, weight 0.5-0.6.

"You should talk to a..."
"I recommend consulting"
"This is best handled by a..."
"You should seek professional..."

Source: arXiv:2308.01263 — XSTest: exaggerated safety behaviors

6. Patronizing Tone (Low Severity)¶

Condescending reminders about AI limitations. 10 patterns, weight 0.35-0.8.

"As an AI, I must remind you"
"I feel obligated to point out"
"It's my duty to inform you"
"For your own safety"

Source: arXiv:2212.08073 — Constitutional AI: non-evasive helpfulness

Scoring¶

HelpfulnessGuard computes a helpfulness score (0.0-1.0):

1.0 — Fully helpful, no unhelpful patterns detected
0.6 — Default threshold; below this, response is flagged
0.0 — Extremely evasive or unhelpful

The score combines:

Raw unhelpfulness — Average weight of matched patterns
Density penalty — More matches relative to response length = lower score

GuardAction::Suggest¶

HelpfulnessGuard uses the Suggest action — it does not block responses, but provides actionable suggestions for improving helpfulness. The passed field remains true even when patterns are detected.

Developer Guide¶

Basic Usage¶

RustPython

use oxide_wellbeing::helpfulness::HelpfulnessGuard;

let guard = HelpfulnessGuard::new("helpfulness");

let result = guard.check("I cannot help with that. As an AI, I must decline.");
if result.unhelpful_detected {
    println!("Score: {:.2}", result.helpfulness_score);
    for suggestion in &result.suggestions {
        println!("Suggestion: {}", suggestion);
    }
}

from oxideshield import helpfulness_guard, HelpfulnessGuard

# Using convenience function
guard = helpfulness_guard()

# Check a response
result = guard.check("I cannot help with that request.")
if not result.passed:
    print(f"Helpfulness issues detected")

Custom Configuration¶

RustPython

use oxide_wellbeing::helpfulness::HelpfulnessGuard;

let guard = HelpfulnessGuard::new("helpfulness")
    .with_threshold(0.5)       // Lower threshold (more permissive)
    .with_target_score(0.8);   // Target helpfulness score

from oxideshield import HelpfulnessGuard

guard = HelpfulnessGuard("helpfulness")
result = guard.check(response_text)

Configuration¶

YAML Configuration¶

guards:
  output:
    - guard_type: "helpfulness"
      action: "suggest"
      options:
        threshold: 0.6
        target_score: 0.8
        categories:
          - evasive_refusal
          - vague_non_answer
          - over_hedging
          - unnecessary_disclaimers
          - premature_deflection
          - patronizing_tone

Proxy Gateway Aliases¶

The guard can be referenced by any of these names:

helpfulness
helpfulness_guard
over_refusal

Best Practices¶

1. Apply to Output (Not Input)¶

HelpfulnessGuard analyzes AI-generated responses, not user prompts. Place it in the output guard pipeline:

guards:
  output:
    - guard_type: "helpfulness"
      action: "suggest"

2. Combine with PsychologicalSafetyGuard¶

Balance helpfulness with safety. PsychologicalSafetyGuard catches genuinely harmful content while HelpfulnessGuard catches unnecessary refusals:

guards:
  output:
    - guard_type: "psychological_safety"  # Catch real harm
    - guard_type: "helpfulness"           # Catch over-refusal

3. Monitor Category Breakdown¶

Track which categories trigger most often to identify systematic issues in your model's behavior:

for category in &result.categories {
    metrics.increment(&format!("unhelpful.{}", category));
}

4. Use Suggestions for Fine-Tuning¶

The suggestions field provides actionable feedback that can be used as training signal for reducing over-refusal.

References¶

Research Sources¶

Constitutional AI (arXiv:2212.08073) — Bai et al., 2022
https://arxiv.org/abs/2212.08073
"Harmless but non-evasive" — the key balance between safety and helpfulness
XSTest (arXiv:2308.01263) — Röttger et al., 2023
https://arxiv.org/abs/2308.01263
12-28% false positive refusal rates across major LLMs on safe prompts
Refusal Direction (arXiv:2406.11717) — Arditi et al., 2024
https://arxiv.org/abs/2406.11717
Refusal mediated by a single direction in residual stream space
Do-Not-Answer (arXiv:2308.13387) — Wang et al., 2023
https://arxiv.org/abs/2308.13387
Dataset distinguishing appropriate vs inappropriate refusal

AccessibilityGuard - Readability and plain language
PsychologicalSafetyGuard - Mental health protection
DarkPatternGuard - UI/response manipulation detection