Monitoring¶

This guide covers production monitoring strategies for OxideShield deployments, building on the Telemetry and Dashboard features.

Overview¶

Component	Purpose	Integration
Dashboard	Real-time UI	Built-in
Prometheus	Metrics collection	Native `/metrics`
OpenTelemetry	Distributed tracing	OTLP export
Alerts	Threshold notifications	Webhook/Slack

Prometheus Integration¶

Scrape Configuration¶

# prometheus.yml
scrape_configs:
  - job_name: 'oxideshield-proxy'
    static_configs:
      - targets: ['oxideshield-proxy:8080']
    metrics_path: /metrics
    scrape_interval: 15s

  - job_name: 'oxideshield-dashboard'
    static_configs:
      - targets: ['oxideshield-dashboard:9090']
    metrics_path: /metrics

Key Metrics¶

Metric	Type	Alert Threshold
`oxideshield_requests_total`	Counter	N/A
`oxideshield_blocks_total`	Counter	Rate > 25%
`oxideshield_latency_ms`	Histogram	p99 > 100ms
`oxideshield_guard_duration_ns`	Histogram	p99 > 50ms
`oxideshield_memory_bytes`	Gauge	> 80% limit

Recording Rules¶

# prometheus-rules.yml
groups:
  - name: oxideshield
    rules:
      - record: oxideshield:block_rate:5m
        expr: rate(oxideshield_blocks_total[5m]) / rate(oxideshield_requests_total[5m])

      - record: oxideshield:latency_p99:5m
        expr: histogram_quantile(0.99, rate(oxideshield_latency_ms_bucket[5m]))

Alerting¶

Prometheus Alerts¶

# alerts.yml
groups:
  - name: oxideshield-alerts
    rules:
      - alert: HighBlockRate
        expr: oxideshield:block_rate:5m > 0.25
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High block rate detected"
          description: "Block rate is {{ $value | humanizePercentage }}"

      - alert: HighLatency
        expr: oxideshield:latency_p99:5m > 100
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High latency detected"
          description: "p99 latency is {{ $value }}ms"

      - alert: ServiceDown
        expr: up{job="oxideshield-proxy"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "OxideShield proxy is down"

Webhook Alerts¶

Configure webhook notifications in your config:

# oxideshield.yaml
alerts:
  webhooks:
    - url: https://hooks.slack.com/services/XXX/YYY/ZZZ
      events: [block, rate_limit, error]

    - url: https://your-siem.example.com/ingest
      events: [block]
      headers:
        Authorization: "Bearer ${SIEM_TOKEN}"

Grafana Dashboards¶

Import Dashboard¶

Navigate to Grafana → Dashboards → Import
Upload examples/grafana/oxideshield-dashboard.json
Select your Prometheus data source

Dashboard Panels¶

Panel	Description
Request Rate	Requests per second over time
Block Rate	Percentage of blocked requests
Latency Distribution	p50, p95, p99 latencies
Guard Breakdown	Per-guard block counts
Top Blocked Patterns	Most frequently matched patterns
Memory Usage	Memory consumption over time

Custom Dashboard¶

{
  "panels": [
    {
      "title": "Block Rate",
      "type": "timeseries",
      "targets": [
        {
          "expr": "oxideshield:block_rate:5m * 100",
          "legendFormat": "Block Rate %"
        }
      ]
    }
  ]
}

OpenTelemetry Integration¶

Collector Configuration¶

# otel-collector.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317

exporters:
  prometheus:
    endpoint: 0.0.0.0:8889
  jaeger:
    endpoint: jaeger:14250

service:
  pipelines:
    traces:
      receivers: [otlp]
      exporters: [jaeger]
    metrics:
      receivers: [otlp]
      exporters: [prometheus]

Rust Configuration¶

use oxide_guard::telemetry::{TelemetryConfig, init_telemetry};

let config = TelemetryConfig::builder()
    .otlp_endpoint("http://otel-collector:4317")
    .service_name("my-llm-api")
    .with_traces(true)
    .with_metrics(true)
    .build();

init_telemetry(&config)?;

Health Checks¶

Endpoints¶

Endpoint	Purpose	Response
`/health`	Basic health	200 OK
`/health/ready`	Readiness	200 if ready
`/health/live`	Liveness	200 if alive

Kubernetes Probes¶

# deployment.yaml
spec:
  containers:
    - name: oxideshield
      livenessProbe:
        httpGet:
          path: /health/live
          port: 8080
        initialDelaySeconds: 5
        periodSeconds: 10
      readinessProbe:
        httpGet:
          path: /health/ready
          port: 8080
        initialDelaySeconds: 5
        periodSeconds: 5

Logging¶

Structured Logging¶

use tracing::{info, warn, instrument};

#[instrument(skip(input))]
fn check_input(input: &str) -> GuardResult {
    info!(input_length = input.len(), "Checking input");
    // ... check logic
}

Log Aggregation¶

Configure JSON logging for aggregation:

logging:
  format: json
  level: info
  output: stdout

Example output:

{
  "timestamp": "2026-01-27T15:30:00Z",
  "level": "INFO",
  "target": "oxide_guard",
  "message": "Request blocked",
  "guard": "PatternGuard",
  "pattern": "ignore_instructions",
  "latency_ms": 0.5
}