Skip to content

EmbeddingPIIFilter

Strips PII from text before embedding generation, preventing privacy leakage through embedding inversion attacks. Supports 8 PII categories with 3 redaction strategies and Luhn-validated credit card detection.

Professional License Required

EmbeddingPIIFilter requires a Professional or Enterprise license. See Licensing for details.

Executive Summary

The Problem

Embedding inversion attacks can reconstruct original text from embeddings. If that text contained PII — emails, SSNs, credit cards, API keys — the PII leaks through the embedding vector. Traditional PII guards operate on text; EmbeddingPIIFilter operates at the embedding boundary, sanitizing text before it reaches the embedder.

Threat Landscape

Attack Vector Example Severity
Embedding inversion Reconstruct john@example.com from 384-dim vector Critical
Training data extraction PII memorized in embedding model weights High
Vector database leakage PII stored in plaintext alongside embeddings High
Cross-tenant inference Shared embedding space reveals PII across tenants Critical

Industry Context

Embedding inversion attacks have been demonstrated to recover up to 92% of original text from embeddings (Morris et al., 2023). GDPR Article 4 classifies embedded PII as personal data, making unfiltered embeddings a compliance risk.

Sources: Morris et al. (2023), ACL 2024 Transferable Embedding Inversion, GDPR Article 4, CCPA requirements


PII Categories

1. Email Addresses

john.doe@example.com → [EMAIL]

2. Phone Numbers (US + International)

(555) 123-4567 → [PHONE]
+1-555-123-4567 → [PHONE]

3. Social Security Numbers (US)

123-45-6789 → [SSN]

4. Credit Card Numbers (Luhn-validated)

Supports Visa, Mastercard, Amex, and Discover. Only flags numbers that pass the Luhn checksum algorithm:

4532015112830366 → [CREDIT_CARD]
1234567890123456 → (not flagged — fails Luhn)

5. API Keys & Tokens

Detects AWS Access Keys, GitHub PATs, Stripe keys, Bearer tokens, and generic API key patterns:

AKIAIOSFODNN7EXAMPLE → [API_KEY]
ghp_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx → [API_KEY]
sk_test_1234567890abcdefghijklmnop → [API_KEY]

6. IP Addresses (IPv4 + IPv6)

192.168.1.100 → [IP_ADDRESS]

7. Dates of Birth (US, ISO, EU formats)

01/15/1990 → [DOB]
1990-01-15 → [DOB]

8. URLs

https://example.com/profile?id=123 → [URL]

Redaction Strategies

Strategy Behavior Example
Replace (default) Category-specific placeholder Email [EMAIL] today
Remove PII removed entirely Email today
Generic Single generic placeholder Email [REDACTED] today

Developer Guide

Basic Usage

use oxide_embeddings::privacy::{EmbeddingPIIFilter, PrivateEmbedder};
use oxide_embeddings::MiniLmEmbedder;

// Standalone filter
let filter = EmbeddingPIIFilter::new();
assert!(filter.contains_pii("Email john@example.com"));

let sanitized = filter.sanitize("Email john@example.com");
assert_eq!(sanitized, "Email [EMAIL]");

// Wrap any embedder with privacy filtering
let embedder = MiniLmEmbedder::new().await?;
let private = PrivateEmbedder::new(embedder);

// PII is stripped before embedding generation
let emb = private.embed("Contact john@example.com for info").await?;
// The embedding is generated from "Contact [EMAIL] for info"
from oxideshield import embedding_pii_filter, EmbeddingPIIFilter

# Using convenience function
filter = embedding_pii_filter()

# Check for PII
has_pii = filter.contains_pii("Email john@example.com")

# Sanitize text
safe_text = filter.sanitize("SSN: 123-45-6789")
# Returns: "SSN: [SSN]"

Custom Configuration

use oxide_embeddings::privacy::{
    EmbeddingPIIFilter, PIIFilterCategory, PIIFilterStrategy, PIIFilterConfig,
};

// Filter specific categories only
let filter = EmbeddingPIIFilter::with_categories(&[
    PIIFilterCategory::Email,
    PIIFilterCategory::SSN,
    PIIFilterCategory::CreditCard,
]);

// Use a different redaction strategy
let filter = EmbeddingPIIFilter::with_strategy(PIIFilterStrategy::Remove);

// Full custom configuration
let config = PIIFilterConfig {
    categories: vec![
        PIIFilterCategory::Email,
        PIIFilterCategory::Phone,
        PIIFilterCategory::SSN,
    ],
    strategy: PIIFilterStrategy::Generic,
};
let filter = EmbeddingPIIFilter::with_config(config);
from oxideshield import EmbeddingPIIFilter

filter = EmbeddingPIIFilter()
safe_text = filter.sanitize("Contact john@example.com or call 555-123-4567")
# Returns: "Contact [EMAIL] or call [PHONE]"

Configuration

YAML Configuration

guards:
  input:
    - guard_type: "embedding_pii_filter"
      action: "block"
      options:
        categories:
          - email
          - phone
          - ssn
          - credit_card
          - api_key
          - ip_address
          - date_of_birth
          - url
        strategy: "replace"  # replace | remove | generic

Proxy Gateway Aliases

The guard can be referenced by any of these names:

  • embedding_pii_filter
  • embedding_pii
  • private_embedder

Best Practices

1. Wrap All Embedders

Use PrivateEmbedder to automatically sanitize all text before embedding:

let embedder = MiniLmEmbedder::new().await?;
let private = PrivateEmbedder::new(embedder);
// All embeds are now PII-safe

2. Use Replace Strategy for Semantic Preservation

The Replace strategy preserves semantic structure (the model sees [EMAIL] as a concept), while Remove can create unnatural text gaps that degrade embedding quality.

3. Combine with RAGInjectionGuard

For RAG pipelines, use EmbeddingPIIFilter at embedding time and RAGInjectionGuard at retrieval time:

guards:
  input:
    - guard_type: "rag_injection"        # Scan retrieved docs
    - guard_type: "embedding_pii_filter" # Strip PII from embeddings
    - guard_type: "pattern"              # General prompt injection

4. Validate Credit Cards with Luhn

EmbeddingPIIFilter uses the Luhn algorithm to avoid false positives on 16-digit numbers that aren't real credit cards.


References

Research Sources

  • Text Embeddings Reveal (Almost) As Much As Text (arXiv:2310.06816) — Morris et al., 2023
  • https://arxiv.org/abs/2310.06816
  • Transferable Embedding Inversion Attack (ACL 2024)
  • https://aclanthology.org/2024.acl-long.230/
  • Eguard: Defending LLM Embeddings (arXiv:2411.05034)
  • https://arxiv.org/abs/2411.05034
  • GDPR Article 4 — Personal data definition includes embedded PII
  • CCPA — California Consumer Privacy Act requirements