Detection & Masking Rules

Detection is what decides what gets masked. SCRUBR combines several detectors into a single pass and resolves overlaps deterministically by priority.

Detectors

Detector	What it catches	Config
Glossary	exact literal terms (codenames, internal hostnames)	`glossary[]`
Rules	regex patterns (token/key formats, PII)	`rules[]`
Entropy	high-entropy strings no named rule covers	`entropy`
NER	person-name PII (heuristic)	`ner`
Secret sources	live values from `.env` / files / Vault	`sources[]`

Glossary terms and secret-source values feed one Aho-Corasick automaton; rules compile into one regex-automata meta-engine. Cost stays roughly flat as you add rules.

flowchart LR
    IN["Request content<br/>scan_paths"] --> G["Glossary<br/>Aho-Corasick"]
    IN --> RX["Rules<br/>regex meta-engine"]
    IN --> EN["Entropy"]
    IN --> NER["NER"]
    SRC[(".env · file · Vault")] -. feeds .-> G
    G --> OV{"Overlap<br/>by priority"}
    RX --> OV
    EN --> OV
    NER --> OV
    OV --> SP["Spans → mask ⟦S:TYPE·id·tag⟧"]

Writing a rule

rules:
  - name: aws_key
    type: AWS_KEY          # shown in the sentinel: ⟦S:AWS_KEY·id⟧
    pattern: '\bAKIA[0-9A-Z]{16}\b'
    priority: 95           # higher wins when spans overlap

Write patterns as YAML single-quoted scalars so backslashes are literal. type is the label that appears in the sentinel and in audit counts. priority breaks overlaps — e.g. an ANTHROPIC_KEY rule at 96 beats a generic OPENAI_KEY rule at 95 on sk-ant-….

Use the curated ruleset

Rather than start from scratch, copy examples/common-rules.yaml: ready-to-use patterns for AWS/GCP/DigitalOcean keys, GitHub/GitLab/Slack/Stripe/SendGrid/ Twilio/npm/OpenAI/Anthropic tokens, JWTs, PEM private keys, credential URLs, bearer tokens, generic key = value assignments, and email — plus a high-entropy catcher.

Scan paths: mask content, not metadata

A rule only runs on the JSON paths a route's profile names. This is how you avoid masking fields that would break the API:

profiles:
  openai:
    scan_paths:   ["messages[].content"]          # mask these
    stream_paths: ["choices[].delta.content"]      # rehydrate these (SSE)

[] descends into every array element; only string leaves are touched. model, temperature, tool schemas, etc. are left exactly as the client sent them.

Tuning

Favor recall. A missed secret is a leak; a false positive just masks a benign string that the model still reasons about via a stable placeholder.
Validate in dry-run. SCRUBR reports detections (header + audit log) without altering the request, so you can measure coverage and false-positive rate before enforcing.
Layer entropy last. Give it a low priority so named rules win and label things precisely; entropy is the safety net for unknown formats.
Pull live secrets in. Point a Vault / .env source at your real credentials so they're masked even without a matching pattern.

See the full configuration reference for every detector option.