Configuration Reference

SCRUBR reads a single YAML config (default scrubr.example.yaml; override with --config or SCRUBR_CONFIG). All sections are optional and default to off/empty. The annotated scrubr.example.yaml is a working starting point; examples/common-rules.yaml is a ready-made detection ruleset.

Config is compiled once into immutable matcher artifacts and hot-reloaded atomically when the file (or a watched secret file) changes; a bad edit keeps the last good config.

routes · profiles · masking
rules · glossary · entropy · ner
sources · auth · tenants
sessions · tls · intercept
audit · transactions
CLI & environment

`routes`

Maps inbound requests to an upstream. In normal (path) mode a route matches by listen_path prefix; in interception mode it matches by host.

Key	Type	Default	Notes
`listen_path`	string	—	Inbound path prefix, e.g. `/openai`. Stripped before forwarding.
`host`	string	—	Host matched in interception mode (e.g. `api.openai.com`).
`upstream`	string	—	Base upstream URL, e.g. `https://api.openai.com`.
`profile`	string	—	Name of a `profiles` entry.
`mode`	enum	global	Per-route override of `masking.mode` (`enforce`/`dry-run`).
`scope`	enum	global	Per-route override of `masking.scope` (`request`/`session`).
`style`	enum	global	Per-route override of `masking.style`.

routes:
  - { listen_path: "/openai",  upstream: "https://api.openai.com",    profile: openai }
  - { listen_path: "/canary",  upstream: "https://api.openai.com",    profile: openai, mode: dry-run }
  - { host: "api.openai.com",  upstream: "https://api.openai.com",    profile: openai }   # interception

A route in enforce mode with no scan_paths logs a warning (it would pass requests through unmasked).

`profiles`

Provider-aware content paths. A path is dot-separated; a [] suffix descends into every array element. Only string leaves are masked/rehydrated. The special path "**" scans every string leaf in the body (comprehensive, opt-in — favors recall over the provider-aware minimalism, so validate for false positives).

Key	Type	Notes
`scan_paths`	string[]	Request JSON paths to mask, e.g. `messages[].content`; or `["**"]` for all string leaves.
`stream_paths`	string[]	SSE response content paths to rehydrate per event, e.g. `choices[].delta.content` (OpenAI), `delta.text` (Anthropic). Required for streaming; `["**"]` rehydrates all leaves.

profiles:
  openai:
    scan_paths:   ["messages[].content", "messages[].tool_calls[].function.arguments"]
    stream_paths: ["choices[].delta.content"]

`masking`

Global masking policy (each field overridable per route/tenant).

Key	Type	Default	Notes
`mode`	`enforce` \| `dry-run`	`enforce`	Dry-run forwards the original and only reports. In enforce mode a JSON-typed body that fails to parse is rejected (422), not forwarded unmasked.
`style`	`typed-sentinel` \| `bare-sentinel`	`typed-sentinel`	`⟦S:EMAIL·id·tag⟧` vs `⟦S·id·tag⟧` (the `tag` is a keyed MAC that authenticates the sentinel).
`scope`	`request` \| `session`	`request`	Session scope gives stable pseudonyms across a conversation.
`ttl`	duration	`30m`	Session idle timeout (`45s`, `30m`, `1h`, `90`).
`session_header`	string	`x-scrubr-session`	Request header identifying a session.

`rules`

Regex detection rules (compiled into one meta-engine). Patterns use Rust regex syntax; write them as YAML single-quoted scalars so backslashes are literal.

Key	Type	Notes
`name`	string	Label.
`type`	string	Entity type shown in the sentinel (e.g. `EMAIL`, `AWS_KEY`).
`pattern`	string	Regex.
`priority`	int	Higher wins on overlap.

rules:
  - { name: aws_key, type: AWS_KEY, pattern: '\bAKIA[0-9A-Z]{16}\b', priority: 95 }

See examples/common-rules.yaml for a curated set.

`glossary`

Literal terms (Aho-Corasick). Same matcher as secret sources.

glossary:
  - { term: "Project Hufflepuff", type: CODENAME, priority: 100 }

`entropy`

Generic high-entropy secret catcher. Off by default; low priority so named rules win.

Key	Type	Default
`enabled`	bool	`false`
`min_bits`	float	`3.5` (bits/char)
`min_len`	int	`20`
`priority`	int	`10`
`entity_type`	string	`SECRET`

`ner`

Heuristic person-name detection (not a trained model; conservative). Off by default.

Key	Type	Default
`enabled`	bool	`false`
`entity_type`	string	`PERSON`
`priority`	int	`30`
`names`	string[]	extra first names beyond the built-in gazetteer

`sources`

External secret values pulled at startup/reload and masked (same automaton as the glossary). Each entry has a kind.

dotenv — each KEY=VALUE line contributes VALUE. file — each non-empty, non-comment line is a literal secret.

Key	Default	Notes
`path`	—	File path (relative to the config dir).
`entity_type`	`SECRET`
`priority`	`80`
`min_len`	`5`	Skip values shorter than this.

vault — HashiCorp Vault KV v2. Token resolution: token → token_path file → token_env (default VAULT_TOKEN). Pulled at startup/reload (not polled).

Key	Default	Notes
`address`	—	e.g. `https://vault.internal:8200`.
`mount`	`secret`	KV v2 mount.
`paths`	—	Secret paths under the mount.
`token` / `token_path` / `token_env`	—	Token sources, in that order.
`entity_type`, `priority`, `min_len`	as above

sources:
  - { kind: dotenv, path: ".env" }
  - { kind: file, path: "secrets.txt", min_len: 6 }
  - kind: vault
    address: "https://vault.internal:8200"
    paths: ["app/prod", "shared/api-keys"]
    token_env: "VAULT_TOKEN"

`auth`

API-key gate on the proxy itself (keys compared in constant time; never forwarded upstream). Required automatically when tenants are defined.

Key	Default	Notes
`enabled`	`false`
`header`	`x-scrubr-key`	Header carrying the key.
`keys`	`[]`	Accepted keys.

/healthz is always reachable without a key.

`tenants`

Multi-tenant policy: a client key identifies a tenant with its own policy, private glossary, and isolated session namespace (precedence: tenant > route > global).

Key	Notes
`id`	Tenant id (used in logs/audit + session namespace).
`keys`	Client keys mapping to this tenant.
`mode` / `scope` / `style`	Optional policy overrides.
`glossary`	Tenant-private terms (masked only for this tenant).

tenants:
  - { id: acme, keys: ["acme-key"], scope: session, glossary: [ { term: "Falcon", type: CODENAME, priority: 100 } ] }
  - { id: globex, keys: ["globex-key"], mode: dry-run }

`sessions`

Where session reverse-maps live.

Key	Default	Notes
`backend`	`memory`	`memory` (single node) or `redis` (cross-node).
`redis_url`	—	e.g. `rediss://redis.internal:6379/`.
`encryption_key`	—	Passphrase → AES-256-GCM at rest; required for secret-free Redis. Also derives the per-session sentinel MAC key, so set it on multi-node clusters or a session's sentinels won't rehydrate across nodes.
`node_id`	random	`0..4095`, distinct per node; partitions the id space.

`tls`

Terminate client HTTPS at the proxy (else plain HTTP). rustls + ring.

Key	Default
`enabled`	`false`
`cert_path` / `key_path`	— (PEM)

`intercept`

TLS interception (MITM): mint a per-host cert from a CA and route by Host. Clients must trust the CA. The CA key can mint any cert — protect it like a root key.

Key	Default	Notes
`enabled`	`false`
`connect`	`false`	`false` = SNI-transparent; `true` = CONNECT proxy.
`listen`	`--listen`	Interception endpoint.
`ca_cert_path` / `ca_key_path`	—	PEM CA used to mint leaf certs.
`upstream_ca_path`	—	Extra CA the proxy trusts for upstream connections.

See DEPLOYMENT.md → TLS interception.

`audit`

Tamper-evident, append-only audit log (counts/types only — never values). Verify with scrubr audit-verify <path>.

Key	Default
`enabled`	`false`
`path`	`scrubr-audit.jsonl`

`transactions`

Full request/response log of the masked provider-facing exchange (secret-free in enforce mode). Each request also returns a x-scrubr-request-id header.

Key	Default	Notes
`enabled`	`false`
`path`	`scrubr-transactions.jsonl`
`max_body_bytes`	`65536`	Per-body capture limit (truncated beyond).

CLI & environment

scrubr [--config <path>] [--listen <addr>]   # start the proxy
scrubr --version                              # print version
scrubr demo                                   # offline mask → rehydrate round-trip
scrubr audit-verify <path>                    # verify an audit log's hash chain

Env	Purpose
`SCRUBR_CONFIG`	config path (overridden by `--config`)
`SCRUBR_LISTEN`	listen address (overridden by `--listen`)
`RUST_LOG`	log filter, e.g. `scrubr=info`
`VAULT_TOKEN`	default Vault token for `vault` sources
`SCRUBR_SESSION_BACKEND`	override `sessions.backend` (`memory`/`redis`)
`SCRUBR_REDIS_URL`	override `sessions.redis_url`
`SCRUBR_ENCRYPTION_KEY`	override `sessions.encryption_key` (at-rest)
`SCRUBR_NODE_ID`	override `sessions.node_id` (0..4095; e.g. a pod ordinal)

The SCRUBR_* session overrides let an orchestrator inject per-instance cluster settings without templating the config — see Deployment → Kubernetes.