Monitoring¶
This guide covers production monitoring strategies for OxideShield deployments, building on the Telemetry and Dashboard features.
Overview¶
| Component | Purpose | Integration |
|---|---|---|
| Dashboard | Real-time UI | Built-in |
| Prometheus | Metrics collection | Native /metrics |
| OpenTelemetry | Distributed tracing | OTLP export |
| Alerts | Threshold notifications | Webhook/Slack |
Prometheus Integration¶
Scrape Configuration¶
# prometheus.yml
scrape_configs:
- job_name: 'oxideshield-proxy'
static_configs:
- targets: ['oxideshield-proxy:8080']
metrics_path: /metrics
scrape_interval: 15s
- job_name: 'oxideshield-dashboard'
static_configs:
- targets: ['oxideshield-dashboard:9090']
metrics_path: /metrics
Key Metrics¶
| Metric | Type | Alert Threshold |
|---|---|---|
oxideshield_requests_total |
Counter | N/A |
oxideshield_blocks_total |
Counter | Rate > 25% |
oxideshield_latency_ms |
Histogram | p99 > 100ms |
oxideshield_guard_duration_ns |
Histogram | p99 > 50ms |
oxideshield_memory_bytes |
Gauge | > 80% limit |
Recording Rules¶
# prometheus-rules.yml
groups:
- name: oxideshield
rules:
- record: oxideshield:block_rate:5m
expr: rate(oxideshield_blocks_total[5m]) / rate(oxideshield_requests_total[5m])
- record: oxideshield:latency_p99:5m
expr: histogram_quantile(0.99, rate(oxideshield_latency_ms_bucket[5m]))
Alerting¶
Prometheus Alerts¶
# alerts.yml
groups:
- name: oxideshield-alerts
rules:
- alert: HighBlockRate
expr: oxideshield:block_rate:5m > 0.25
for: 5m
labels:
severity: warning
annotations:
summary: "High block rate detected"
description: "Block rate is {{ $value | humanizePercentage }}"
- alert: HighLatency
expr: oxideshield:latency_p99:5m > 100
for: 5m
labels:
severity: warning
annotations:
summary: "High latency detected"
description: "p99 latency is {{ $value }}ms"
- alert: ServiceDown
expr: up{job="oxideshield-proxy"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "OxideShield proxy is down"
Webhook Alerts¶
Configure webhook notifications in your config:
# oxideshield.yaml
alerts:
webhooks:
- url: https://hooks.slack.com/services/XXX/YYY/ZZZ
events: [block, rate_limit, error]
- url: https://your-siem.example.com/ingest
events: [block]
headers:
Authorization: "Bearer ${SIEM_TOKEN}"
Grafana Dashboards¶
Import Dashboard¶
- Navigate to Grafana → Dashboards → Import
- Upload
examples/grafana/oxideshield-dashboard.json - Select your Prometheus data source
Dashboard Panels¶
| Panel | Description |
|---|---|
| Request Rate | Requests per second over time |
| Block Rate | Percentage of blocked requests |
| Latency Distribution | p50, p95, p99 latencies |
| Guard Breakdown | Per-guard block counts |
| Top Blocked Patterns | Most frequently matched patterns |
| Memory Usage | Memory consumption over time |
Custom Dashboard¶
{
"panels": [
{
"title": "Block Rate",
"type": "timeseries",
"targets": [
{
"expr": "oxideshield:block_rate:5m * 100",
"legendFormat": "Block Rate %"
}
]
}
]
}
OpenTelemetry Integration¶
Collector Configuration¶
# otel-collector.yaml
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
exporters:
prometheus:
endpoint: 0.0.0.0:8889
jaeger:
endpoint: jaeger:14250
service:
pipelines:
traces:
receivers: [otlp]
exporters: [jaeger]
metrics:
receivers: [otlp]
exporters: [prometheus]
Rust Configuration¶
use oxide_guard::telemetry::{TelemetryConfig, init_telemetry};
let config = TelemetryConfig::builder()
.otlp_endpoint("http://otel-collector:4317")
.service_name("my-llm-api")
.with_traces(true)
.with_metrics(true)
.build();
init_telemetry(&config)?;
Health Checks¶
Endpoints¶
| Endpoint | Purpose | Response |
|---|---|---|
/health |
Basic health | 200 OK |
/health/ready |
Readiness | 200 if ready |
/health/live |
Liveness | 200 if alive |
Kubernetes Probes¶
# deployment.yaml
spec:
containers:
- name: oxideshield
livenessProbe:
httpGet:
path: /health/live
port: 8080
initialDelaySeconds: 5
periodSeconds: 10
readinessProbe:
httpGet:
path: /health/ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
Logging¶
Structured Logging¶
use tracing::{info, warn, instrument};
#[instrument(skip(input))]
fn check_input(input: &str) -> GuardResult {
info!(input_length = input.len(), "Checking input");
// ... check logic
}
Log Aggregation¶
Configure JSON logging for aggregation:
Example output:
{
"timestamp": "2026-01-27T15:30:00Z",
"level": "INFO",
"target": "oxide_guard",
"message": "Request blocked",
"guard": "PatternGuard",
"pattern": "ignore_instructions",
"latency_ms": 0.5
}
See Also¶
- Telemetry - OpenTelemetry configuration
- Dashboard - Real-time dashboard
- Proxy Gateway - Proxy deployment