Skip to content

HelpfulnessGuard

Detects AI responses that are overly evasive, vague, or excessively refusing to be helpful. Scores responses for helpfulness — the flip side of Constitutional AI's "harmless but non-evasive" objective. Uses GuardAction::Suggest to provide improvement recommendations without blocking.

Professional License Required

HelpfulnessGuard requires a Professional or Enterprise license. See Licensing for details.

Executive Summary

The Problem

LLMs frequently over-refuse, hedge excessively, or provide vague non-answers to perfectly safe questions. Research shows 12-28% false positive refusal rates on safe prompts (XSTest). This degrades user trust and utility. HelpfulnessGuard detects these patterns and suggests improvements.

Threat Landscape

Pattern Example Severity
Evasive refusal "I cannot help with that request" High
Vague non-answer "It depends on many factors" Medium
Over-hedging "As an AI language model, I should note..." Medium
Unnecessary disclaimers "This is not medical advice" on a cooking question Low
Premature deflection "You should talk to a professional" for basic questions Medium
Patronizing tone "As an AI, I must remind you that..." Low

Industry Context

Constitutional AI established the principle of "harmless but non-evasive." XSTest (2023) quantified the problem: major LLMs refuse 12-28% of safe prompts. The Refusal Direction paper (2024) showed refusal is mediated by a single direction in residual stream space — a learned behavior that can be miscalibrated.

Sources: Constitutional AI (arXiv:2212.08073), XSTest (arXiv:2308.01263), Refusal Direction (arXiv:2406.11717), Do-Not-Answer (arXiv:2308.13387)


Detection Categories

1. Evasive Refusal (High Severity)

Refusing to engage with safe topics. 10 patterns, weight 0.7-0.9.

"I cannot help with that"
"I must decline"
"I'm unable to assist"
"That's not something I can..."

Source: arXiv:2308.01263 — XSTest: 12-28% false positive refusal rates

2. Vague Non-Answer (Medium Severity)

Generic responses that avoid the actual question. 10 patterns, weight 0.45-0.7.

"It depends on many factors"
"There are many perspectives"
"It's a complex topic"
"There's no simple answer"

Source: arXiv:2212.08073 — Constitutional AI: harmless but non-evasive

3. Over-Hedging (Medium Severity)

Excessive qualifications that obscure the answer. 10 patterns, weight 0.4-0.8.

"As an AI language model"
"I'm just an AI"
"It's important to note that"
"While I can try to help"

Source: arXiv:2406.11717 — Refusal Direction: refusal as single learned direction

4. Unnecessary Disclaimers (Low Severity)

Adding disclaimers where none are needed. 10 patterns, weight 0.45-0.7.

"This is not medical advice"
"Disclaimer:"
"For informational purposes only"
"I am not a licensed..."

Source: arXiv:2308.13387 — Do-Not-Answer: appropriate vs inappropriate refusal

5. Premature Deflection (Medium Severity)

Redirecting to professionals prematurely. 10 patterns, weight 0.5-0.6.

"You should talk to a..."
"I recommend consulting"
"This is best handled by a..."
"You should seek professional..."

Source: arXiv:2308.01263 — XSTest: exaggerated safety behaviors

6. Patronizing Tone (Low Severity)

Condescending reminders about AI limitations. 10 patterns, weight 0.35-0.8.

"As an AI, I must remind you"
"I feel obligated to point out"
"It's my duty to inform you"
"For your own safety"

Source: arXiv:2212.08073 — Constitutional AI: non-evasive helpfulness


Scoring

HelpfulnessGuard computes a helpfulness score (0.0-1.0):

  • 1.0 — Fully helpful, no unhelpful patterns detected
  • 0.6 — Default threshold; below this, response is flagged
  • 0.0 — Extremely evasive or unhelpful

The score combines:

  1. Raw unhelpfulness — Average weight of matched patterns
  2. Density penalty — More matches relative to response length = lower score

GuardAction::Suggest

HelpfulnessGuard uses the Suggest action — it does not block responses, but provides actionable suggestions for improving helpfulness. The passed field remains true even when patterns are detected.


Developer Guide

Basic Usage

use oxide_wellbeing::helpfulness::HelpfulnessGuard;

let guard = HelpfulnessGuard::new("helpfulness");

let result = guard.check("I cannot help with that. As an AI, I must decline.");
if result.unhelpful_detected {
    println!("Score: {:.2}", result.helpfulness_score);
    for suggestion in &result.suggestions {
        println!("Suggestion: {}", suggestion);
    }
}
from oxideshield import helpfulness_guard, HelpfulnessGuard

# Using convenience function
guard = helpfulness_guard()

# Check a response
result = guard.check("I cannot help with that request.")
if not result.passed:
    print(f"Helpfulness issues detected")

Custom Configuration

use oxide_wellbeing::helpfulness::HelpfulnessGuard;

let guard = HelpfulnessGuard::new("helpfulness")
    .with_threshold(0.5)       // Lower threshold (more permissive)
    .with_target_score(0.8);   // Target helpfulness score
from oxideshield import HelpfulnessGuard

guard = HelpfulnessGuard("helpfulness")
result = guard.check(response_text)

Configuration

YAML Configuration

guards:
  output:
    - guard_type: "helpfulness"
      action: "suggest"
      options:
        threshold: 0.6
        target_score: 0.8
        categories:
          - evasive_refusal
          - vague_non_answer
          - over_hedging
          - unnecessary_disclaimers
          - premature_deflection
          - patronizing_tone

Proxy Gateway Aliases

The guard can be referenced by any of these names:

  • helpfulness
  • helpfulness_guard
  • over_refusal

Best Practices

1. Apply to Output (Not Input)

HelpfulnessGuard analyzes AI-generated responses, not user prompts. Place it in the output guard pipeline:

guards:
  output:
    - guard_type: "helpfulness"
      action: "suggest"

2. Combine with PsychologicalSafetyGuard

Balance helpfulness with safety. PsychologicalSafetyGuard catches genuinely harmful content while HelpfulnessGuard catches unnecessary refusals:

guards:
  output:
    - guard_type: "psychological_safety"  # Catch real harm
    - guard_type: "helpfulness"           # Catch over-refusal

3. Monitor Category Breakdown

Track which categories trigger most often to identify systematic issues in your model's behavior:

for category in &result.categories {
    metrics.increment(&format!("unhelpful.{}", category));
}

4. Use Suggestions for Fine-Tuning

The suggestions field provides actionable feedback that can be used as training signal for reducing over-refusal.


References

Research Sources

  • Constitutional AI (arXiv:2212.08073) — Bai et al., 2022
  • https://arxiv.org/abs/2212.08073
  • "Harmless but non-evasive" — the key balance between safety and helpfulness
  • XSTest (arXiv:2308.01263) — Röttger et al., 2023
  • https://arxiv.org/abs/2308.01263
  • 12-28% false positive refusal rates across major LLMs on safe prompts
  • Refusal Direction (arXiv:2406.11717) — Arditi et al., 2024
  • https://arxiv.org/abs/2406.11717
  • Refusal mediated by a single direction in residual stream space
  • Do-Not-Answer (arXiv:2308.13387) — Wang et al., 2023
  • https://arxiv.org/abs/2308.13387
  • Dataset distinguishing appropriate vs inappropriate refusal