PII redaction
Identify personal data across ingestion, indexing, prompts, and logs; default to minimizing what enters model context, and apply reversible or irreversible masking for retained content.
The SKILL should require agents to use a versioned rule library (not brittle one-off regex) when generating ingestion, logging, and RAG pipeline code, and to separate direct identifiers from quasi-identifiers that may enable re-identification when combined.
Scan before the pipeline; RAG chunks may be replaced with tokens or hash references, with raw text only in controlled stores. False positives are handled with business token allowlists and field-level policy.
PII handling flow (skill-flow-block)
[ Entry: API / upload / sync / agent tool responses ]
│
▼
┌─────────────│ Field-level classification; default: no raw PII in vectors or prompts
│Classify & │──── Quasi-identifier combos; business IDs →internal pseudonyms
│ minimize │
└─────────────│
│
▼
┌─────────────│ Rule library + NER/dictionary; allowlists and suppression paths
│Scan & mask │──── Reversible token vs irreversible hash/truncation
└─────────────│
│
▼
┌─────────────│ Chunk-level replacement; access to controlled originals requires auth
│RAG / model │──── Prompt templates must not concatenate unsanitized free text
└─────────────│
│
▼
┌─────────────│ Structured log field redaction; trace attribute tiers
│Logs & trace │──── Sampling and dynamic redaction; high sensitivity: hash or omit
└─────────────│
│
▼
┌─────────────│ Retention, purpose, vendor DPA; eval sets
│Audit & comp │──── Second scan before UI echo; cross-border and data-subject flows
└─────────────┘
Principle: if a field does not need to enter the model or index, keep it out; when it must, use pseudonyms or masks while keeping retrieval and troubleshooting workable (e.g. stable hashes to correlate a session).
Common PII regex rules and redaction implementation
// pii-redactor.ts — common PII regex rule library
interface PiiRule {
name: string;
pattern: RegExp;
mask: (match: string) => string;
}
export const PII_RULES: PiiRule[] = [
{
name: 'mobile',
pattern: /(?<!\d)(1[3-9]\d{9})(?!\d)/g,
mask: (m) => m.slice(0, 3) + '****' + m.slice(7), // 138****5678
},
{
name: 'email',
pattern: /\b([A-Za-z0-9._%+-]+)@([A-Za-z0-9.-]+\.[A-Za-z]{2,})\b/g,
mask: (m) => {
const at = m.indexOf('@');
return m.charAt(0) + '***' + m.slice(at); // z***@example.com
},
},
{
name: 'ipv4',
pattern: /\b(\d{1,3}\.){3}(\d{1,3})\b/g,
mask: (m) => m.replace(/\.\d+$/, '.***'), // 192.168.1.***
},
];
export function redactText(text: string, rules = PII_RULES): string {
let result = text;
for (const rule of rules) {
rule.pattern.lastIndex = 0;
result = result.replace(rule.pattern, rule.mask);
}
return result;
}
// Allowlist: order numbers and internal IDs should not be misidentified as PII
const ALLOWLIST_PATTERNS = [
/\border-[A-Z0-9]{8,}\b/g, // order number
/\breq-[a-f0-9]{8,}\b/g, // request ID
];
export function redactWithAllowlist(text: string): string {
// First replace allowlist items with placeholders
const placeholders: string[] = [];
let masked = text;
ALLOWLIST_PATTERNS.forEach((re, i) => {
masked = masked.replace(re, (m) => {
placeholders.push(m);
return `__WL${i}_${placeholders.length - 1}__`;
});
});
// Apply PII redaction
masked = redactText(masked);
// Restore allowlist items
masked = masked.replace(/__WL\d+_(\d+)__/g, (_, idx) => placeholders[Number(idx)]);
return masked;
}Logging redaction middleware (auto-redact request/response logs)
// pii-log-middleware.ts — Express request/response log auto-redaction
import { redactText } from './pii-redactor';
const PII_LOG_FIELDS = ['email', 'phone', 'mobile', 'id_card', 'password', 'token'];
function redactObject(obj: any, depth = 0): any {
if (depth > 5 || obj === null || obj === undefined) return obj;
if (typeof obj === 'string') return redactText(obj);
if (Array.isArray(obj)) return obj.map(v => redactObject(v, depth + 1));
if (typeof obj === 'object') {
const result: Record<string, any> = {};
for (const [key, value] of Object.entries(obj)) {
if (PII_LOG_FIELDS.some(f => key.toLowerCase().includes(f))) {
result[key] = '[REDACTED]'; // field name match, redact entirely
} else {
result[key] = redactObject(value, depth + 1);
}
}
return result;
}
return obj;
}
export function piiLogMiddleware(req, res, next) {
// Redact request log
const safeBody = redactObject(req.body);
const safeQuery = redactObject(req.query);
console.log(JSON.stringify({
level: 'info',
type: 'request',
method: req.method,
path: req.path,
query: safeQuery,
body: safeBody,
request_id: req.headers['x-request-id'], // use request_id to correlate, not full PII
timestamp: new Date().toISOString(),
}));
// Intercept response log
const originalJson = res.json.bind(res);
res.json = (body) => {
const safeBody = redactObject(body);
console.log(JSON.stringify({
level: 'info',
type: 'response',
status: res.statusCode,
body: safeBody,
request_id: req.headers['x-request-id'],
}));
return originalJson(body);
};
next();
}PII detection and replacement before sending to LLM
// llm-pii-guard.ts — auto-detect and replace PII before sending to LLM
import { redactWithAllowlist } from './pii-redactor';
import OpenAI from 'openai';
interface RedactionMap {
[placeholder: string]: string; // { '[PHONE_1]': '13812345678' }
}
// Reversible redaction: replace with placeholders, save map for restoration
function redactForLLM(text: string): { redacted: string; map: RedactionMap } {
const map: RedactionMap = {};
let counter = 0;
let redacted = text;
const rules = [
{ name: 'PHONE', pattern: /1[3-9]\d{9}/g },
{ name: 'EMAIL', pattern: /[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}/g },
{ name: 'ID_CARD', pattern: /\d{17}[\dXx]/g },
];
for (const rule of rules) {
rule.pattern.lastIndex = 0;
redacted = redacted.replace(rule.pattern, (match) => {
const placeholder = `[${rule.name}_${++counter}]`;
map[placeholder] = match;
return placeholder;
});
}
return { redacted, map };
}
// Use example
const client = new OpenAI();
async function safeChat(userMessage: string): Promise<string> {
// Redact before sending to LLM
const { redacted, map } = redactForLLM(userMessage);
const response = await client.chat.completions.create({
model: 'gpt-4o',
messages: [
{ role: 'system', content: 'You are a customer service assistant.' },
{ role: 'user', content: redacted }, // send redacted content
],
});
const reply = response.choices[0].message.content ?? '';
// Optionally restore original values in the reply (business decision)
// return restoreFromLLM(reply, map);
return reply;
}
// Restore placeholders in LLM output (when original values must be shown)
function restoreFromLLM(text: string, map: RedactionMap): string {
return Object.entries(map).reduce(
(t, [placeholder, original]) => t.replace(placeholder, original),
text
);
}
// RAG: redact document chunks before vectorizing
async function indexDocument(content: string) {
const { redacted } = redactForLLM(content);
// Store redacted version in vector store; original in controlled storage
await vectorStore.upsert({ text: redacted });
await secureStorage.store({ original: content });
}
- Agent tools: tool parameters and return values go through the same redaction configuration.
- Evaluation: adversarial samples and multilingual PII coverage; record false positive and missed detection rates.
Masking vs encryption vs tokenization — selection criteria
| Approach | Reversible | When to use | When not to use |
|---|---|---|---|
Masking138****5678 |
No | Logs, external reports, customer service UI where only partial visibility is needed | Business flows that need to recover the original value (e.g. sending an SMS) |
| Encryption AES-256-GCM |
Yes (with key) | Need to decrypt and use the original value per business logic (e.g. sending notifications) | Field-level search required (encrypted values cannot be queried directly) |
Tokenizationtok_abc123 |
Yes (with vault) | Payment card PAN storage (PCI DSS); cross-system reference to the same user | Range queries or analytics against the original value are needed |
| Hashing SHA-256 + salt |
No | Deduplication checks (existence), session correlation (stable hash) | Display of original value needed; phone numbers have low entropy (rainbow-table risk) |
- Cross-border: transfer mechanisms and regional deployment options are placeholders in SKILL output—confirmed by legal.
- Do not treat “redaction applied” as a legal conclusion in automated output; significant policy decisions require human review.
Client-side redaction demo (illustrative)
Below is a browser-only illustration using regex replacement to show masking shapes. Production must use reviewed rule libraries and dedicated engines—do not copy this page’s script verbatim.
Click “Apply redaction—again after edits; the copy is for draft docs only—not proof of compliance for real personal data.
---
name: pii-redaction
description: PII identification, redaction and masking, log middleware, LLM pipeline PII protection, comparison of handling strategies
---
# Steps
1. Identify PII types: direct (email, phone, ID card), quasi-identifiers (birthdate + zip), sensitive (health, financial)
2. Build regex rule library: covering email, mobile, IP address, ID card number
3. Allowlist mechanism: order numbers and internal IDs to prevent false positives
4. Logging middleware: auto-redact PII from request/response logs based on field names
5. LLM/RAG pipeline: replace PII with reversible placeholders before sending; original in controlled storage
6. Vector database: store redacted versions only; never send raw PII to embedding API
7. Agent tool call parameters: apply same redaction pipeline before logging tool args
8. Data minimization: fields that don't need to enter models/indices should not enter
9. Redaction vs encryption vs tokenization vs hashing: choose based on reversibility needs
10. Evaluate coverage: adversarial test samples + multilingual PII (names in different scripts)
11. Allowlist governance: version-controlled, reviewed; entries with rationale and expiry
12. Cross-border transfer: redact or pseudonymize before transferring across jurisdiction boundaries
# Anti-patterns
- Do NOT pass raw PII directly to LLM prompts or vector databases
- Do NOT rely only on regex without context-aware checks (high false-positive rate)
- Do NOT use redaction as a substitute for access control