EmbeddingPIIFilter¶
Strips PII from text before embedding generation, preventing privacy leakage through embedding inversion attacks. Supports 8 PII categories with 3 redaction strategies and Luhn-validated credit card detection.
Professional License Required
EmbeddingPIIFilter requires a Professional or Enterprise license. See Licensing for details.
Executive Summary¶
The Problem¶
Embedding inversion attacks can reconstruct original text from embeddings. If that text contained PII — emails, SSNs, credit cards, API keys — the PII leaks through the embedding vector. Traditional PII guards operate on text; EmbeddingPIIFilter operates at the embedding boundary, sanitizing text before it reaches the embedder.
Threat Landscape¶
| Attack Vector | Example | Severity |
|---|---|---|
| Embedding inversion | Reconstruct john@example.com from 384-dim vector |
Critical |
| Training data extraction | PII memorized in embedding model weights | High |
| Vector database leakage | PII stored in plaintext alongside embeddings | High |
| Cross-tenant inference | Shared embedding space reveals PII across tenants | Critical |
Industry Context¶
Embedding inversion attacks have been demonstrated to recover up to 92% of original text from embeddings (Morris et al., 2023). GDPR Article 4 classifies embedded PII as personal data, making unfiltered embeddings a compliance risk.
Sources: Morris et al. (2023), ACL 2024 Transferable Embedding Inversion, GDPR Article 4, CCPA requirements
PII Categories¶
1. Email Addresses¶
2. Phone Numbers (US + International)¶
3. Social Security Numbers (US)¶
4. Credit Card Numbers (Luhn-validated)¶
Supports Visa, Mastercard, Amex, and Discover. Only flags numbers that pass the Luhn checksum algorithm:
5. API Keys & Tokens¶
Detects AWS Access Keys, GitHub PATs, Stripe keys, Bearer tokens, and generic API key patterns:
AKIAIOSFODNN7EXAMPLE → [API_KEY]
ghp_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx → [API_KEY]
sk_test_1234567890abcdefghijklmnop → [API_KEY]
6. IP Addresses (IPv4 + IPv6)¶
7. Dates of Birth (US, ISO, EU formats)¶
8. URLs¶
Redaction Strategies¶
| Strategy | Behavior | Example |
|---|---|---|
| Replace (default) | Category-specific placeholder | Email [EMAIL] today |
| Remove | PII removed entirely | Email today |
| Generic | Single generic placeholder | Email [REDACTED] today |
Developer Guide¶
Basic Usage¶
use oxide_embeddings::privacy::{EmbeddingPIIFilter, PrivateEmbedder};
use oxide_embeddings::MiniLmEmbedder;
// Standalone filter
let filter = EmbeddingPIIFilter::new();
assert!(filter.contains_pii("Email john@example.com"));
let sanitized = filter.sanitize("Email john@example.com");
assert_eq!(sanitized, "Email [EMAIL]");
// Wrap any embedder with privacy filtering
let embedder = MiniLmEmbedder::new().await?;
let private = PrivateEmbedder::new(embedder);
// PII is stripped before embedding generation
let emb = private.embed("Contact john@example.com for info").await?;
// The embedding is generated from "Contact [EMAIL] for info"
Custom Configuration¶
use oxide_embeddings::privacy::{
EmbeddingPIIFilter, PIIFilterCategory, PIIFilterStrategy, PIIFilterConfig,
};
// Filter specific categories only
let filter = EmbeddingPIIFilter::with_categories(&[
PIIFilterCategory::Email,
PIIFilterCategory::SSN,
PIIFilterCategory::CreditCard,
]);
// Use a different redaction strategy
let filter = EmbeddingPIIFilter::with_strategy(PIIFilterStrategy::Remove);
// Full custom configuration
let config = PIIFilterConfig {
categories: vec![
PIIFilterCategory::Email,
PIIFilterCategory::Phone,
PIIFilterCategory::SSN,
],
strategy: PIIFilterStrategy::Generic,
};
let filter = EmbeddingPIIFilter::with_config(config);
Configuration¶
YAML Configuration¶
guards:
input:
- guard_type: "embedding_pii_filter"
action: "block"
options:
categories:
- email
- phone
- ssn
- credit_card
- api_key
- ip_address
- date_of_birth
- url
strategy: "replace" # replace | remove | generic
Proxy Gateway Aliases¶
The guard can be referenced by any of these names:
embedding_pii_filterembedding_piiprivate_embedder
Best Practices¶
1. Wrap All Embedders¶
Use PrivateEmbedder to automatically sanitize all text before embedding:
let embedder = MiniLmEmbedder::new().await?;
let private = PrivateEmbedder::new(embedder);
// All embeds are now PII-safe
2. Use Replace Strategy for Semantic Preservation¶
The Replace strategy preserves semantic structure (the model sees [EMAIL] as a concept), while Remove can create unnatural text gaps that degrade embedding quality.
3. Combine with RAGInjectionGuard¶
For RAG pipelines, use EmbeddingPIIFilter at embedding time and RAGInjectionGuard at retrieval time:
guards:
input:
- guard_type: "rag_injection" # Scan retrieved docs
- guard_type: "embedding_pii_filter" # Strip PII from embeddings
- guard_type: "pattern" # General prompt injection
4. Validate Credit Cards with Luhn¶
EmbeddingPIIFilter uses the Luhn algorithm to avoid false positives on 16-digit numbers that aren't real credit cards.
References¶
Research Sources¶
- Text Embeddings Reveal (Almost) As Much As Text (arXiv:2310.06816) — Morris et al., 2023
- https://arxiv.org/abs/2310.06816
- Transferable Embedding Inversion Attack (ACL 2024)
- https://aclanthology.org/2024.acl-long.230/
- Eguard: Defending LLM Embeddings (arXiv:2411.05034)
- https://arxiv.org/abs/2411.05034
- GDPR Article 4 — Personal data definition includes embedded PII
- CCPA — California Consumer Privacy Act requirements
Related Guards¶
- RAGInjectionGuard - RAG document injection detection
- PIIGuard - General PII detection in text
- PatternGuard - General prompt injection detection