Skip to content

Monitoring

This guide covers production monitoring strategies for OxideShield deployments, building on the Telemetry and Dashboard features.

Overview

Component Purpose Integration
Dashboard Real-time UI Built-in
Prometheus Metrics collection Native /metrics
OpenTelemetry Distributed tracing OTLP export
Alerts Threshold notifications Webhook/Slack

Prometheus Integration

Scrape Configuration

# prometheus.yml
scrape_configs:
  - job_name: 'oxideshield-proxy'
    static_configs:
      - targets: ['oxideshield-proxy:8080']
    metrics_path: /metrics
    scrape_interval: 15s

  - job_name: 'oxideshield-dashboard'
    static_configs:
      - targets: ['oxideshield-dashboard:9090']
    metrics_path: /metrics

Key Metrics

Metric Type Alert Threshold
oxideshield_requests_total Counter N/A
oxideshield_blocks_total Counter Rate > 25%
oxideshield_latency_ms Histogram p99 > 100ms
oxideshield_guard_duration_ns Histogram p99 > 50ms
oxideshield_memory_bytes Gauge > 80% limit

Recording Rules

# prometheus-rules.yml
groups:
  - name: oxideshield
    rules:
      - record: oxideshield:block_rate:5m
        expr: rate(oxideshield_blocks_total[5m]) / rate(oxideshield_requests_total[5m])

      - record: oxideshield:latency_p99:5m
        expr: histogram_quantile(0.99, rate(oxideshield_latency_ms_bucket[5m]))

Alerting

Prometheus Alerts

# alerts.yml
groups:
  - name: oxideshield-alerts
    rules:
      - alert: HighBlockRate
        expr: oxideshield:block_rate:5m > 0.25
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High block rate detected"
          description: "Block rate is {{ $value | humanizePercentage }}"

      - alert: HighLatency
        expr: oxideshield:latency_p99:5m > 100
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High latency detected"
          description: "p99 latency is {{ $value }}ms"

      - alert: ServiceDown
        expr: up{job="oxideshield-proxy"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "OxideShield proxy is down"

Webhook Alerts

Configure webhook notifications in your config:

# oxideshield.yaml
alerts:
  webhooks:
    - url: https://hooks.slack.com/services/XXX/YYY/ZZZ
      events: [block, rate_limit, error]

    - url: https://your-siem.example.com/ingest
      events: [block]
      headers:
        Authorization: "Bearer ${SIEM_TOKEN}"

Grafana Dashboards

Import Dashboard

  1. Navigate to Grafana → Dashboards → Import
  2. Upload examples/grafana/oxideshield-dashboard.json
  3. Select your Prometheus data source

Dashboard Panels

Panel Description
Request Rate Requests per second over time
Block Rate Percentage of blocked requests
Latency Distribution p50, p95, p99 latencies
Guard Breakdown Per-guard block counts
Top Blocked Patterns Most frequently matched patterns
Memory Usage Memory consumption over time

Custom Dashboard

{
  "panels": [
    {
      "title": "Block Rate",
      "type": "timeseries",
      "targets": [
        {
          "expr": "oxideshield:block_rate:5m * 100",
          "legendFormat": "Block Rate %"
        }
      ]
    }
  ]
}

OpenTelemetry Integration

Collector Configuration

# otel-collector.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317

exporters:
  prometheus:
    endpoint: 0.0.0.0:8889
  jaeger:
    endpoint: jaeger:14250

service:
  pipelines:
    traces:
      receivers: [otlp]
      exporters: [jaeger]
    metrics:
      receivers: [otlp]
      exporters: [prometheus]

Rust Configuration

use oxide_guard::telemetry::{TelemetryConfig, init_telemetry};

let config = TelemetryConfig::builder()
    .otlp_endpoint("http://otel-collector:4317")
    .service_name("my-llm-api")
    .with_traces(true)
    .with_metrics(true)
    .build();

init_telemetry(&config)?;

Health Checks

Endpoints

Endpoint Purpose Response
/health Basic health 200 OK
/health/ready Readiness 200 if ready
/health/live Liveness 200 if alive

Kubernetes Probes

# deployment.yaml
spec:
  containers:
    - name: oxideshield
      livenessProbe:
        httpGet:
          path: /health/live
          port: 8080
        initialDelaySeconds: 5
        periodSeconds: 10
      readinessProbe:
        httpGet:
          path: /health/ready
          port: 8080
        initialDelaySeconds: 5
        periodSeconds: 5

Logging

Structured Logging

use tracing::{info, warn, instrument};

#[instrument(skip(input))]
fn check_input(input: &str) -> GuardResult {
    info!(input_length = input.len(), "Checking input");
    // ... check logic
}

Log Aggregation

Configure JSON logging for aggregation:

logging:
  format: json
  level: info
  output: stdout

Example output:

{
  "timestamp": "2026-01-27T15:30:00Z",
  "level": "INFO",
  "target": "oxide_guard",
  "message": "Request blocked",
  "guard": "PatternGuard",
  "pattern": "ignore_instructions",
  "latency_ms": 0.5
}

See Also