EmbeddingPIIFilter¶

Strips PII from text before embedding generation, preventing privacy leakage through embedding inversion attacks. Supports 8 PII categories with 3 redaction strategies and Luhn-validated credit card detection.

Professional License Required

EmbeddingPIIFilter requires a Professional or Enterprise license. See Licensing for details.

Executive Summary¶

The Problem¶

Embedding inversion attacks can reconstruct original text from embeddings. If that text contained PII — emails, SSNs, credit cards, API keys — the PII leaks through the embedding vector. Traditional PII guards operate on text; EmbeddingPIIFilter operates at the embedding boundary, sanitizing text before it reaches the embedder.

Threat Landscape¶

Attack Vector	Example	Severity
Embedding inversion	Reconstruct `john@example.com` from 384-dim vector	Critical
Training data extraction	PII memorized in embedding model weights	High
Vector database leakage	PII stored in plaintext alongside embeddings	High
Cross-tenant inference	Shared embedding space reveals PII across tenants	Critical

Industry Context¶

Embedding inversion attacks have been demonstrated to recover up to 92% of original text from embeddings (Morris et al., 2023). GDPR Article 4 classifies embedded PII as personal data, making unfiltered embeddings a compliance risk.

Sources: Morris et al. (2023), ACL 2024 Transferable Embedding Inversion, GDPR Article 4, CCPA requirements

PII Categories¶

1. Email Addresses¶

john.doe@example.com → [EMAIL]

2. Phone Numbers (US + International)¶

(555) 123-4567 → [PHONE]
+1-555-123-4567 → [PHONE]

123-45-6789 → [SSN]

4. Credit Card Numbers (Luhn-validated)¶

Supports Visa, Mastercard, Amex, and Discover. Only flags numbers that pass the Luhn checksum algorithm:

4532015112830366 → [CREDIT_CARD]
1234567890123456 → (not flagged — fails Luhn)

5. API Keys & Tokens¶

Detects AWS Access Keys, GitHub PATs, Stripe keys, Bearer tokens, and generic API key patterns:

AKIAIOSFODNN7EXAMPLE → [API_KEY]
ghp_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx → [API_KEY]
sk_test_1234567890abcdefghijklmnop → [API_KEY]

6. IP Addresses (IPv4 + IPv6)¶

192.168.1.100 → [IP_ADDRESS]

7. Dates of Birth (US, ISO, EU formats)¶

01/15/1990 → [DOB]
1990-01-15 → [DOB]

8. URLs¶

https://example.com/profile?id=123 → [URL]

Redaction Strategies¶

Strategy	Behavior	Example
Replace (default)	Category-specific placeholder	`Email [EMAIL] today`
Remove	PII removed entirely	`Email today`
Generic	Single generic placeholder	`Email [REDACTED] today`

Developer Guide¶

Basic Usage¶

RustPython

use oxide_embeddings::privacy::{EmbeddingPIIFilter, PrivateEmbedder};
use oxide_embeddings::MiniLmEmbedder;

// Standalone filter
let filter = EmbeddingPIIFilter::new();
assert!(filter.contains_pii("Email john@example.com"));

let sanitized = filter.sanitize("Email john@example.com");
assert_eq!(sanitized, "Email [EMAIL]");

// Wrap any embedder with privacy filtering
let embedder = MiniLmEmbedder::new().await?;
let private = PrivateEmbedder::new(embedder);

// PII is stripped before embedding generation
let emb = private.embed("Contact john@example.com for info").await?;
// The embedding is generated from "Contact [EMAIL] for info"

from oxideshield import embedding_pii_filter, EmbeddingPIIFilter

# Using convenience function
filter = embedding_pii_filter()

# Check for PII
has_pii = filter.contains_pii("Email john@example.com")

# Sanitize text
safe_text = filter.sanitize("SSN: 123-45-6789")
# Returns: "SSN: [SSN]"

Custom Configuration¶

RustPython

use oxide_embeddings::privacy::{
    EmbeddingPIIFilter, PIIFilterCategory, PIIFilterStrategy, PIIFilterConfig,
};

// Filter specific categories only
let filter = EmbeddingPIIFilter::with_categories(&[
    PIIFilterCategory::Email,
    PIIFilterCategory::SSN,
    PIIFilterCategory::CreditCard,
]);

// Use a different redaction strategy
let filter = EmbeddingPIIFilter::with_strategy(PIIFilterStrategy::Remove);

// Full custom configuration
let config = PIIFilterConfig {
    categories: vec![
        PIIFilterCategory::Email,
        PIIFilterCategory::Phone,
        PIIFilterCategory::SSN,
    ],
    strategy: PIIFilterStrategy::Generic,
};
let filter = EmbeddingPIIFilter::with_config(config);

from oxideshield import EmbeddingPIIFilter

filter = EmbeddingPIIFilter()
safe_text = filter.sanitize("Contact john@example.com or call 555-123-4567")
# Returns: "Contact [EMAIL] or call [PHONE]"

Configuration¶

YAML Configuration¶

guards:
  input:
    - guard_type: "embedding_pii_filter"
      action: "block"
      options:
        categories:
          - email
          - phone
          - ssn
          - credit_card
          - api_key
          - ip_address
          - date_of_birth
          - url
        strategy: "replace"  # replace | remove | generic

Proxy Gateway Aliases¶

The guard can be referenced by any of these names:

embedding_pii_filter
embedding_pii
private_embedder

Best Practices¶

1. Wrap All Embedders¶

Use PrivateEmbedder to automatically sanitize all text before embedding:

let embedder = MiniLmEmbedder::new().await?;
let private = PrivateEmbedder::new(embedder);
// All embeds are now PII-safe

2. Use Replace Strategy for Semantic Preservation¶

The Replace strategy preserves semantic structure (the model sees [EMAIL] as a concept), while Remove can create unnatural text gaps that degrade embedding quality.

3. Combine with RAGInjectionGuard¶

For RAG pipelines, use EmbeddingPIIFilter at embedding time and RAGInjectionGuard at retrieval time:

guards:
  input:
    - guard_type: "rag_injection"        # Scan retrieved docs
    - guard_type: "embedding_pii_filter" # Strip PII from embeddings
    - guard_type: "pattern"              # General prompt injection

4. Validate Credit Cards with Luhn¶

EmbeddingPIIFilter uses the Luhn algorithm to avoid false positives on 16-digit numbers that aren't real credit cards.

References¶

Research Sources¶

Text Embeddings Reveal (Almost) As Much As Text (arXiv:2310.06816) — Morris et al., 2023
https://arxiv.org/abs/2310.06816
Transferable Embedding Inversion Attack (ACL 2024)
https://aclanthology.org/2024.acl-long.230/
Eguard: Defending LLM Embeddings (arXiv:2411.05034)
https://arxiv.org/abs/2411.05034
GDPR Article 4 — Personal data definition includes embedded PII
CCPA — California Consumer Privacy Act requirements

RAGInjectionGuard - RAG document injection detection
PIIGuard - General PII detection in text
PatternGuard - General prompt injection detection