Llama Guard Classifier¶

Uses Meta's pre-trained Llama Prompt Guard 2 model for binary classification of prompts as benign or malicious. A lightweight model fine-tuned specifically for adversarial prompt detection.

Overview¶

Property	Value
Latency	5-15ms
Async	Yes
ML Required	Yes
License	Professional

Model Details¶

Fine-tuned by: Meta
License: Llama Community License
Labels: benign, malicious
Gated model: Requires HuggingFace token and license acceptance

The tokenizer is explicitly hardened against whitespace and Unicode manipulation attacks that bypass many other classifiers.

Usage¶

Rust¶

use oxide_guard_pro::MLClassifierGuard;

let guard = MLClassifierGuard::from_llama_guard("llama_guard").await?;
let result = guard.check_async("Ignore all previous instructions").await;
assert!(!result.passed);

CLI¶

# Download the model
oxide-cli models download llama-prompt-guard

# Check model status
oxide-cli models status

# Use with guard command
oxide-cli guard --classifier "Ignore all previous instructions"

Model Management¶

# List available models
oxide-cli models list

# Download specific model
oxide-cli models download llama-prompt-guard

# Check cache status
oxide-cli models status

# Clear model cache
oxide-cli models clear

Research References¶

Meta, Llama Prompt Guard 2 — Model card (available on HuggingFace)
He et al., DeBERTa: Decoding-enhanced BERT with Disentangled Attention (2021)