PII redaction

Identify personal data across ingestion, indexing, prompts, and logs; default to minimizing what enters model context, and apply reversible or irreversible masking for retained content.

The SKILL should require agents to use a versioned rule library (not brittle one-off regex) when generating ingestion, logging, and RAG pipeline code, and to separate direct identifiers from quasi-identifiers that may enable re-identification when combined.

Scan before the pipeline; RAG chunks may be replaced with tokens or hash references, with raw text only in controlled stores. False positives are handled with business token allowlists and field-level policy.

PII handling flow (skill-flow-block)

  [ Entry: API / upload / sync / agent tool responses ]
        │
        ▼
  ┌─────────────│    Field-level classification; default: no raw PII in vectors or prompts
  │Classify &   │──── Quasi-identifier combos; business IDs →internal pseudonyms
  │ minimize    │
  └─────────────│
        │
        ▼
  ┌─────────────│    Rule library + NER/dictionary; allowlists and suppression paths
  │Scan & mask  │──── Reversible token vs irreversible hash/truncation
  └─────────────│
        │
        ▼
  ┌─────────────│    Chunk-level replacement; access to controlled originals requires auth
  │RAG / model  │──── Prompt templates must not concatenate unsanitized free text
  └─────────────│
        │
        ▼
  ┌─────────────│    Structured log field redaction; trace attribute tiers
  │Logs & trace │──── Sampling and dynamic redaction; high sensitivity: hash or omit
  └─────────────│
        │
        ▼
  ┌─────────────│    Retention, purpose, vendor DPA; eval sets
  │Audit & comp │──── Second scan before UI echo; cross-border and data-subject flows
  └─────────────┘

Principle: if a field does not need to enter the model or index, keep it out; when it must, use pseudonyms or masks while keeping retrieval and troubleshooting workable (e.g. stable hashes to correlate a session).

Common PII regex rules and redaction implementation

// pii-redactor.ts — common PII regex rule library
interface PiiRule {
  name: string;
  pattern: RegExp;
  mask: (match: string) => string;
}

export const PII_RULES: PiiRule[] = [
  {
    name: 'mobile',
    pattern: /(?<!\d)(1[3-9]\d{9})(?!\d)/g,
    mask: (m) => m.slice(0, 3) + '****' + m.slice(7),   // 138****5678
  },
  {
    name: 'email',
    pattern: /\b([A-Za-z0-9._%+-]+)@([A-Za-z0-9.-]+\.[A-Za-z]{2,})\b/g,
    mask: (m) => {
      const at = m.indexOf('@');
      return m.charAt(0) + '***' + m.slice(at);  // z***@example.com
    },
  },
  {
    name: 'ipv4',
    pattern: /\b(\d{1,3}\.){3}(\d{1,3})\b/g,
    mask: (m) => m.replace(/\.\d+$/, '.***'),  // 192.168.1.***
  },
];

export function redactText(text: string, rules = PII_RULES): string {
  let result = text;
  for (const rule of rules) {
    rule.pattern.lastIndex = 0;
    result = result.replace(rule.pattern, rule.mask);
  }
  return result;
}

// Allowlist: order numbers and internal IDs should not be misidentified as PII
const ALLOWLIST_PATTERNS = [
  /\border-[A-Z0-9]{8,}\b/g,      // order number
  /\breq-[a-f0-9]{8,}\b/g,        // request ID
];

export function redactWithAllowlist(text: string): string {
  // First replace allowlist items with placeholders
  const placeholders: string[] = [];
  let masked = text;
  ALLOWLIST_PATTERNS.forEach((re, i) => {
    masked = masked.replace(re, (m) => {
      placeholders.push(m);
      return `__WL${i}_${placeholders.length - 1}__`;
    });
  });
  // Apply PII redaction
  masked = redactText(masked);
  // Restore allowlist items
  masked = masked.replace(/__WL\d+_(\d+)__/g, (_, idx) => placeholders[Number(idx)]);
  return masked;
}

Logging redaction middleware (auto-redact request/response logs)

// pii-log-middleware.ts — Express request/response log auto-redaction
import { redactText } from './pii-redactor';

const PII_LOG_FIELDS = ['email', 'phone', 'mobile', 'id_card', 'password', 'token'];

function redactObject(obj: any, depth = 0): any {
  if (depth > 5 || obj === null || obj === undefined) return obj;
  if (typeof obj === 'string') return redactText(obj);
  if (Array.isArray(obj)) return obj.map(v => redactObject(v, depth + 1));
  if (typeof obj === 'object') {
    const result: Record<string, any> = {};
    for (const [key, value] of Object.entries(obj)) {
      if (PII_LOG_FIELDS.some(f => key.toLowerCase().includes(f))) {
        result[key] = '[REDACTED]';  // field name match, redact entirely
      } else {
        result[key] = redactObject(value, depth + 1);
      }
    }
    return result;
  }
  return obj;
}

export function piiLogMiddleware(req, res, next) {
  // Redact request log
  const safeBody = redactObject(req.body);
  const safeQuery = redactObject(req.query);

  console.log(JSON.stringify({
    level: 'info',
    type: 'request',
    method: req.method,
    path: req.path,
    query: safeQuery,
    body: safeBody,
    request_id: req.headers['x-request-id'],  // use request_id to correlate, not full PII
    timestamp: new Date().toISOString(),
  }));

  // Intercept response log
  const originalJson = res.json.bind(res);
  res.json = (body) => {
    const safeBody = redactObject(body);
    console.log(JSON.stringify({
      level: 'info',
      type: 'response',
      status: res.statusCode,
      body: safeBody,
      request_id: req.headers['x-request-id'],
    }));
    return originalJson(body);
  };

  next();
}

PII detection and replacement before sending to LLM

// llm-pii-guard.ts — auto-detect and replace PII before sending to LLM
import { redactWithAllowlist } from './pii-redactor';
import OpenAI from 'openai';

interface RedactionMap {
  [placeholder: string]: string;  // { '[PHONE_1]': '13812345678' }
}

// Reversible redaction: replace with placeholders, save map for restoration
function redactForLLM(text: string): { redacted: string; map: RedactionMap } {
  const map: RedactionMap = {};
  let counter = 0;
  let redacted = text;

  const rules = [
    { name: 'PHONE', pattern: /1[3-9]\d{9}/g },
    { name: 'EMAIL', pattern: /[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}/g },
    { name: 'ID_CARD', pattern: /\d{17}[\dXx]/g },
  ];

  for (const rule of rules) {
    rule.pattern.lastIndex = 0;
    redacted = redacted.replace(rule.pattern, (match) => {
      const placeholder = `[${rule.name}_${++counter}]`;
      map[placeholder] = match;
      return placeholder;
    });
  }

  return { redacted, map };
}

// Use example
const client = new OpenAI();

async function safeChat(userMessage: string): Promise<string> {
  // Redact before sending to LLM
  const { redacted, map } = redactForLLM(userMessage);

  const response = await client.chat.completions.create({
    model: 'gpt-4o',
    messages: [
      { role: 'system', content: 'You are a customer service assistant.' },
      { role: 'user', content: redacted },  // send redacted content
    ],
  });

  const reply = response.choices[0].message.content ?? '';

  // Optionally restore original values in the reply (business decision)
  // return restoreFromLLM(reply, map);
  return reply;
}

// Restore placeholders in LLM output (when original values must be shown)
function restoreFromLLM(text: string, map: RedactionMap): string {
  return Object.entries(map).reduce(
    (t, [placeholder, original]) => t.replace(placeholder, original),
    text
  );
}

// RAG: redact document chunks before vectorizing
async function indexDocument(content: string) {
  const { redacted } = redactForLLM(content);
  // Store redacted version in vector store; original in controlled storage
  await vectorStore.upsert({ text: redacted });
  await secureStorage.store({ original: content });
}
  • Agent tools: tool parameters and return values go through the same redaction configuration.
  • Evaluation: adversarial samples and multilingual PII coverage; record false positive and missed detection rates.

Masking vs encryption vs tokenization — selection criteria

Approach Reversible When to use When not to use
Masking
138****5678
No Logs, external reports, customer service UI where only partial visibility is needed Business flows that need to recover the original value (e.g. sending an SMS)
Encryption
AES-256-GCM
Yes (with key) Need to decrypt and use the original value per business logic (e.g. sending notifications) Field-level search required (encrypted values cannot be queried directly)
Tokenization
tok_abc123
Yes (with vault) Payment card PAN storage (PCI DSS); cross-system reference to the same user Range queries or analytics against the original value are needed
Hashing
SHA-256 + salt
No Deduplication checks (existence), session correlation (stable hash) Display of original value needed; phone numbers have low entropy (rainbow-table risk)
  • Cross-border: transfer mechanisms and regional deployment options are placeholders in SKILL output—confirmed by legal.
  • Do not treat “redaction applied” as a legal conclusion in automated output; significant policy decisions require human review.

Client-side redaction demo (illustrative)

Below is a browser-only illustration using regex replacement to show masking shapes. Production must use reviewed rule libraries and dedicated engines—do not copy this page’s script verbatim.

Enabled patterns (illustrative)

Click “Apply redaction—again after edits; the copy is for draft docs only—not proof of compliance for real personal data.

---
name: pii-redaction
description: PII identification, redaction and masking, log middleware, LLM pipeline PII protection, comparison of handling strategies
---
# Steps
1. Identify PII types: direct (email, phone, ID card), quasi-identifiers (birthdate + zip), sensitive (health, financial)
2. Build regex rule library: covering email, mobile, IP address, ID card number
3. Allowlist mechanism: order numbers and internal IDs to prevent false positives
4. Logging middleware: auto-redact PII from request/response logs based on field names
5. LLM/RAG pipeline: replace PII with reversible placeholders before sending; original in controlled storage
6. Vector database: store redacted versions only; never send raw PII to embedding API
7. Agent tool call parameters: apply same redaction pipeline before logging tool args
8. Data minimization: fields that don't need to enter models/indices should not enter
9. Redaction vs encryption vs tokenization vs hashing: choose based on reversibility needs
10. Evaluate coverage: adversarial test samples + multilingual PII (names in different scripts)
11. Allowlist governance: version-controlled, reviewed; entries with rationale and expiry
12. Cross-border transfer: redact or pseudonymize before transferring across jurisdiction boundaries

# Anti-patterns
- Do NOT pass raw PII directly to LLM prompts or vector databases
- Do NOT rely only on regex without context-aware checks (high false-positive rate)
- Do NOT use redaction as a substitute for access control

Back to skills More skills