SCRUBR — Design
Sensitive Content Replacement, Unmasking, Brokering & Rehydration
A single-binary forward proxy that masks secrets / PII / sensitive data on outbound requests and rehydrates (unmasks) it on the inbound response — including mid-stream. It sits between your applications and external LLM providers so the provider only ever sees opaque placeholders, while your users still receive fully reconstituted responses.
Positioning: an LLM gateway optimizes routing, cost, and caching and treats the payload as opaque. SCRUBR does the opposite — it owns the payload and guarantees a lossless, reversible de-identification round trip. The wedge is security & compliance (SOC 2, PCI-DSS, HIPAA, GDPR), not routing.
1. Goals & non-goals
Goals
- Lossless reversibility.
mask → provider → rehydratereconstructs the original exactly. A leaked placeholder in user output is treated as a correctness failure. - Streaming-first. Full SSE/chunked support with minimal added time-to-first-token.
- Speed. Sub-millisecond, near-zero-allocation scan on the hot path; overhead dominated by the upstream provider, not by SCRUBR.
- Config-driven first. Glossary + regex + entropy rules from config files, hot-reloaded.
- Provider-agnostic. Upstreams (endpoint URLs) are defined in config, not hardcoded. Any OpenAI/Anthropic-compatible or arbitrary HTTP(S) model API works by adding a route.
- Single binary, multi-arch. No mandatory external dependency for single-node operation.
- Provable compliance story. The provider demonstrably saw only opaque ids; audit log records what categories were masked, never the values.
Non-goals (initially)
- TLS interception / MITM. v0 is an explicit endpoint (clients point at SCRUBR and SCRUBR re-originates upstream). MITM is a later option, not the default.
- ML/NER PII detection inline. Deferred to a later phase behind the same span interface.
- Cost-based routing / load-balancing across providers (that's an LLM-gateway concern). Note: SCRUBR still routes by upstream — each inbound route maps to a configured upstream URL — it just doesn't make cost/latency routing decisions.
- Media (image / audio / document) content scanning. Text-only in v0, but the pipeline is
designed so a media
Detectorslots in later behind the same span interface (see §6, §8).
Primary users
- Enterprises placing a compliance guardrail in front of external LLM APIs.
- Developers who want the same protection locally (later fed by
.env/ secret-file scans).
2. Reversibility model (the core decision)
Masking must be reversible, which rules out format-preserving fake values as the default. A fake-but-valid replacement (e.g. swapping one email for another) looks natural to the model, but on the return path it is indistinguishable from ordinary text. If the model reformats, translates, or "corrects" it even slightly, rehydration silently fails and a placeholder leaks. Unacceptable under a reversibility guarantee.
Sentinel grammar
Default to a deterministic sentinel the model treats as an opaque identifier and tends to pass through unchanged:
⟦S·7f3a·9x2⟧ bare (generic)
⟦S:EMAIL·7f3a·9x2⟧ typed (type hint keeps model output coherent)
Grammar (fixed, self-delimiting):
sentinel := PREFIX [ ":" TYPE ] "·" ID "·" TAG SUFFIX
PREFIX := "⟦S"
SUFFIX := "⟧"
TYPE := [A-Z]+ // EMAIL, SECRET, CODENAME, ...
ID := base62(u32) // index into the reverse table
TAG := base62(u32) // keyed MAC of ID (authenticates the sentinel)
Design properties:
- Rare, self-delimiting prefix → return-path scan is a single memchr for ⟦S
followed by a bounded parse. No per-request automaton to rebuild.
- The id is an index, not the data. 7f3a indexes a request-local reverse table.
The secret never leaves SCRUBR's memory. This is the compliance guarantee.
- Authenticated. The tag is a per-vault keyed MAC (truncated HMAC-SHA256) of the id.
Rehydration resolves an id only if its tag matches, so a hostile/compromised upstream
cannot forge or blindly enumerate sentinels (⟦S·0⟧, ⟦S·1⟧, …) to read the vault.
- Deterministic per scope. A forward map original → id collapses repeated
occurrences to the same id, so the model sees consistency and we dedupe for free.
- Typed variant is the quality knob: bare sentinels can make models hallucinate
around "missing" context; a type hint mitigates it while staying trivially reversible.
Configurable per rule.
Format-preserving masking remains a non-default, opt-in style for cases where a rule is one-way (redaction, never rehydrated) or where model quality strictly requires it and the value is known to pass through verbatim.
3. Round trip
EGRESS (request)
scan body bytes (single pass)
└─ detect spans ─ glossary (Aho-Corasick) ‖ regex (RegexSet) ‖ entropy
for each span (priority, then longest-match-wins):
id = forward_map.get_or_insert(hash(span))
reverse[id] = span // request- or session-scoped
emit ⟦S:TYPE·id·tag⟧ in place of span
forward scrubbed body upstream (vectored write of [slice|sentinel|slice|...])
INGRESS (response, possibly SSE stream)
per chunk:
memchr('⟦S') → parse id → reverse[id] → splice original back in
hold back only a bounded tail that might be a partial sentinel (max_sentinel_len)
flush everything before the tail immediately (protect TTFT)
on stream end:
zeroize reverse map + forward map (secure wipe)
Correctness traps that must be handled in v0
- Partial sentinel across chunk boundaries. Buffer back at most
max_sentinel_lenbytes; everything before it is safe to flush. - Overlapping / competing matches on egress. Deterministic resolution: by
priority, then longest-match-wins. Without this, the same input can mask differently run-to-run, breaking determinism. - Model-invented lookalikes. An id not present in the reverse table is left verbatim
(never error, never guess). Rare prefix makes collisions negligible.
- Cross-node concurrency. Each node allocates ids from a disjoint
IdSpace(node_idhigh bits + counter), and the Redis backend stores each entry as its own hash field — so two nodes interning concurrently in one session never collide ids and never clobber each other's entries (no lost rehydration). Dedup stays correct because the forward map is rebuilt from all nodes' entries on load. - Mapping lifetime. Request scope wipes at response end; session scope holds until TTL.
Both use
zeroizeon drop. - Sentinel fragmented across SSE delta events. Real LLM streaming tokenizes a sentinel
into several
data:events with JSON/SSE framing between the pieces — so raw-byte scanning sees⟦S:EM"}}…data:{…"AIL·0⟧and can't reassemble it. The response path detectstext/event-stream, buffers whole events, and runs each event'sstream_pathscontent (e.g.choices[].delta.content) through a persistent rehydrator whose carry buffer spans events; re-serialization re-escapes the spliced original. (Non-streaming JSON keeps the simpler raw-byte path, where the full sentinel is contiguous in one body.)
4. Performance strategy
Ordered by impact.
Detection (egress hot path)
- One pass, two automata. ✅ All glossary literals → a single
aho-corasickautomaton (SIMD Teddy / memchr). All regex rules → oneregex-automatameta-engine vianew_many, matched in a single pass (patterns ordered by priority for same-offset arbitration). Detection cost is ~flat in rule count: 2 rules ≈ 25 rules incargo bench. - Compile config → immutable matcher artifacts once. Never compile per request.
Hot-reload builds new artifacts off-thread, swaps via
arc-swap(lock-free reads). - Provider-aware scanning. Parse only the sensitive JSON paths (
messages[].content, tool-call arguments) — less work and fewer false positives.
Allocation & copying
- Zero-copy by default. Body as
Bytes; output is a list of slices[orig 0..120][sentinel][orig 145..600]…written vectored. Only matched spans allocate. - Per-request bump arena (
bumpalo) for span list + maps; dropped wholesale at request end. Pool and reuse scan buffers. - Byte-level work, skipping UTF-8 revalidation where boundary safety is guaranteed.
Concurrency & state
- Request-local reverse map → zero shared-lock contention on the default path.
- Lock-free config (
arc-swap); sharded structures only where genuinely shared. - Redis is opt-in, only for sessions that span nodes — never on the critical path.
I/O & transport
- Build on
pingora(orhyper+tower) for upstream connection pooling, keep-alive, and H2 multiplexing. Reusing the TLS session to the provider saves more wall-clock than any scan optimization. - Streaming-first everywhere, bounded buffers, backpressure-aware vectored writes.
Measurement
- p99 added-latency is a first-class metric from day one.
criterionbenches over a representative prompt corpus; every optimization proven, not assumed.
5. Configuration (v0)
Compiled once into immutable matchers, hot-reloaded on change.
# scrubr.yaml
# Upstreams are config, not code — point a route at any model API.
routes:
- listen_path: "/openai" # what clients hit on SCRUBR
upstream: "https://api.openai.com"
profile: openai # which scan profile to apply
- listen_path: "/anthropic"
upstream: "https://api.anthropic.com"
profile: anthropic
- listen_path: "/internal-llm" # self-hosted / arbitrary provider
upstream: "http://llm.internal:8000"
profile: openai # OpenAI-compatible schema → reuse profile
profiles:
openai:
scan_paths:
- "messages[].content"
- "messages[].tool_calls[].function.arguments"
anthropic:
scan_paths:
- "messages[].content"
- "system"
masking:
style: typed-sentinel # typed-sentinel | bare-sentinel | format-preserving(opt-in)
scope: session # request | session (determinism boundary)
ttl: 30m
glossary: # literal terms → Aho-Corasick
- { term: "Project Hufflepuff", type: CODENAME, priority: 100 }
rules: # regex → RegexSet
- { name: email, type: EMAIL, pattern: '…', priority: 50 }
- { name: aws_key, type: SECRET, pattern: 'AKIA[0-9A-Z]{16}', priority: 90 }
entropy: # optional generic-secret catcher
enabled: true
min_bits: 4.0
The glossary is the same interface a secret store will feed later — connectors just become another source that populates the Aho-Corasick automaton at reload time. Nothing in v0 is throwaway.
6. Component architecture
flowchart TD
C(["Client"]) -->|request| L["Listener — HTTP/1.1, H2, SSE"]
L --> D["Detection pipeline · egress<br/>Aho-Corasick ‖ RegexSet ‖ entropy — single pass, merged spans"]
D -->|spans| V["Vaultizer + Mapping<br/>span → ⟦S:TYPE·id·tag⟧ · in-mem, Redis opt-in for cluster"]
V -->|scrubbed body| U[["Upstream provider"]]
U -->|"response · often SSE"| R["Rehydration state-machine · ingress<br/>memchr scan · reverse lookup · bounded-tail buffering"]
R -->|rehydrated| C
Cross-cutting: config + hot reload · policy engine · observability/audit · secure wipe.
Trait boundaries (stable seams for later phases)
SecretSource— pluggable origin of sensitive terms (config file → Vault / AWS / GCP /.env/ file-scan later). All feed the same matcher build.Detector— produces spans over a content unit (Aho-Corasick, RegexSet, entropy; ML/NER and media detectors later — OCR/vision over images, ASR over audio — all reduce to the same span output so the Vaultizer + mapping path is unchanged).Vaultizer— span → placeholder + reverse-map entry (sentinel default; FPE opt-in).Upstream/Route— maps an inbound listen path to a configured upstream URL +ScanProfile. Provider-agnostic: adding a model API is a config entry, not code.ScanProfile— per-route description of what to scan (JSON content paths today; media parts —image_url, base64 blobs, attachments — later).MappingStore— request-local (default) / session / Redis-backed.
7. Compliance & security posture
- Provider sees only opaque ids — attestable data-minimization property.
- Tamper-evident audit log: per request, the categories and counts detected/masked
— never the values. Hash-chained; verified with
scrubr audit-verify. Writes are synchronous and flushed per record — a deliberate durability-over-latency choice so a crash cannot lose audit records for requests that were served (compliance > throughput). - Proxy authentication: API keys compared in constant time (no early-return / hash oracle); the proxy's own key is never forwarded upstream.
- At-rest encryption: session vaults in a shared store are sealed with AES-256-GCM so Redis only ever holds ciphertext.
- Policy-as-code per route/tenant/data-class, with a dry-run mode (report what would be masked) for onboarding trust.
- Secure destruction:
zeroizemappings on drop; bounded TTL for session scope. - No secret ever logged; metrics are counts/types only.
-
Liveness: unauthenticated
/healthzfor load balancers, bypassing auth and routing. -
TLS termination: optional client-facing HTTPS via rustls (
ringprovider — no OpenSSL/aws-lc, so the cross-compiled static binaries are unaffected).
8. Roadmap
Status: ✅ done · 🚧 partial · ⬜ not started.
1.0 shipped — phases v0–v5 are GA (plus a Helm chart with HA, a multi-arch
container image, and a docs website). The remaining items below are the post-1.0
roadmap.
| Phase | Status | Scope |
|---|---|---|
| v0 | ✅ | Explicit-endpoint proxy, provider-aware scan, glossary + regex + entropy from config, typed-sentinel masking, request + session in-mem map, streaming rehydration, hot reload, criterion benches. |
| v1 | ✅ | Tamper-evident audit log ✅, dry-run mode ✅, per-route policy ✅, proxy auth ✅, per-tenant policy ✅, multi-arch release binaries ✅ (CI + build-release.sh). |
| v2 | 🚧 | .env + secret-file ✅; HashiCorp Vault (KV v2) connector ✅; AWS / GCP secret managers ⏸️ (same SecretSource seam). Curated common-secret ruleset shipped. |
| v3 | ✅ | Redis-backed clustering for cross-node sessions (load-modify-store behind a SessionBackend seam); AES-256-GCM at-rest encryption; node-disjoint ids + per-field hashes for concurrent correctness. |
| v4 | ✅ | NER/PII detection behind the SpanDetector seam — heuristic person-name detector shipped; a model-backed detector plugs in via the same trait (Detector::with_detectors). |
| v5 | ✅ | TLS interception (MITM): per-host certs minted on the fly from a configured CA (CertMinter + rustls ResolvesServerCert), routed by Host. Both SNI-transparent and CONNECT-proxy modes (the latter blind-tunnels un-intercepted hosts). |
| v6 | ⏸️ | Media scanning — OCR/vision over images, ASR over audio — deferred until a stable release; reuses the span → Vaultizer → rehydration path. |
Post-
1.0: the AWS/GCP secret-manager connectors (v2) and media scanning (v6) remain designed-for — theSecretSourceandDetector/Spanseams are in place — and are scheduled now that the1.0core has shipped.
9. First thing to prototype
The streaming rehydration state machine (§3 ingress) is the riskiest, demo-defining piece — getting an SSE round trip with mid-stream sentinels rehydrating correctly is the demo that makes the product believable. Build it first, behind a criterion bench.