Security Policy

SCRUBR is a security tool: it sits in the data path and handles plaintext secrets and PII. Treat it as a high-value component and deploy it accordingly.

Reporting a vulnerability

Please report security issues privately — do not open a public issue.

Email: security@scrubr.example (replace with your project contact)
Include: affected version, a description, and reproduction steps or a PoC.

We aim to acknowledge within 3 business days and to provide a remediation timeline after triage. Coordinated disclosure is appreciated; please give us a reasonable window before public disclosure.

Supported versions

The latest 1.x release receives security fixes.

Threat model

What SCRUBR protects

Data minimization to the provider. With masking enforced, the upstream LLM provider receives only opaque sentinels — never the original secrets/PII. This is the core, attestable property.
Reversibility integrity. A masked value round-trips losslessly, including across streamed (SSE) responses, or is left verbatim — it is never silently corrupted or mis-rehydrated.

Trust boundary

SCRUBR necessarily sees plaintext request/response content (it is the masking broker). Run it inside your trust boundary, on hosts and networks you control, with least-privilege access. Anyone who can read SCRUBR's memory, its config, its secret sources, or (for the Redis backend) the session store can see secrets.

Sentinels are authenticated; session keys are still bearer secrets. Every sentinel carries a per-vault keyed MAC tag (⟦S:TYPE·id·tag⟧), so a hostile or compromised upstream cannot forge or blindly enumerate sentinels (⟦S·0⟧, ⟦S·1⟧, …) to read the vault — only sentinels SCRUBR actually issued rehydrate. What remains inherent to reversibility: with scope: session, everyone presenting the same session-header value shares one vault, and an upstream that received a sentinel earlier in the session can replay it — which re-reveals that value to the session owner (not to the upstream). So still use one session per user/trust-unit, make session keys unguessable, and don't mix different users' secrets under one session key. (Request scope confines everything to the caller's own current request.) For cross-node sessions the tag key is derived from sessions.encryption_key, so set it — otherwise nodes can't agree on tags and a session's sentinels won't rehydrate on another node.

Sensitive material and how it is handled

In-memory vaults (request/session mappings) are zeroized on drop; session scope is bounded by TTL.
Redis-backed sessions persist mappings off-process. Enable sessions.encryption_key (AES-256-GCM) so the store holds only ciphertext, and run Redis with AUTH + TLS on a private network. Give every node a distinct sessions.node_id (the Helm chart derives it from the pod ordinal) — colliding ids share an id space and corrupt sessions. Run Redis HA: a transient read failure is surfaced loudly but can still corrupt a session's mappings for that request.
The interception CA key is the most dangerous secret in the system — it can mint a trusted certificate for any host. Protect intercept.ca_key_path with the same rigor as a root signing key (restricted FS permissions, ideally an HSM/KMS in production), and scope the CA's distribution to managed clients only.
Auth keys are compared as fixed-length SHA-256 digests in constant time (revealing neither which key matched nor any key's length) and never forwarded upstream.
Audit log is hash-chained and tamper-evident (scrubr audit-verify), but it is a local file: protect it and consider shipping to append-only/WORM storage. It records detection counts and types only — never values.
Transaction log (optional) captures the masked provider-facing request and response — secret-free in enforce mode. In dry-run mode nothing is masked, so records contain original content; protect the file and avoid dry-run + transactions outside a trusted boundary. Audit and transaction logs are created 0600 (owner-only) on Unix.

Network egress

Upstream redirects are never followed. A 3xx from the upstream is passed through to the client, so a compromised/malicious upstream cannot redirect SCRUBR to an internal service or metadata endpoint (SSRF), nor cause SCRUBR to rehydrate an attacker-chosen target's response with the client's secrets. The Vault connector likewise never follows redirects (its token can't leak to another host).
The CONNECT proxy is not an open relay to internal hosts. Blind tunnels refuse loopback and link-local targets (blocking the cloud metadata endpoint at 169.254.169.254 and localhost pivots), and connect to the exact vetted IP. Still, bind the proxy to trusted networks — it will relay to arbitrary public hosts by design.
Certificate minting is bounded to configured interception hosts, so an attacker cannot force unbounded key-generation with arbitrary SNI values.

Operational guidance

Terminate client TLS at SCRUBR (tls) or run it behind a TLS terminator; the plain-HTTP listener is for trusted local networks only.
Start with dry-run mode to validate detection coverage before enforcing.
Bias detection toward recall for secret/PII categories — a false negative (a leak) is worse than a false positive (a degraded prompt).
Rotate auth keys and the interception CA on a schedule.

Known limitations (as of 1.0)

Masking covers configured JSON content paths. In enforce mode a JSON-typed body that does not parse is rejected (422) rather than forwarded, and a profile can set scan_paths: ["**"] to scan every string leaf. Still, a body sent with a non-JSON content type, or a secret no rule matches, passes through: SCRUBR prevents leakage in well-formed provider requests; it is not a DLP control against a client deliberately exfiltrating over an unscanned channel.
The at-rest encryption_key is derived via SHA-256, not a password-stretching KDF — use a high-entropy key (it is a shared cluster secret, not a password).
The audit hash-chain detects edits and mid-file deletions, but truncation of the most recent records is not self-evident — ship to append-only/WORM storage if that matters.
Auth keys are static (no built-in rotation/expiry).
The heuristic NER is not a trained model; it favors precision and will miss many names. Use it as defense-in-depth, not a sole PII control.
Concurrent cross-node writes to the same session are last-write-wins per field; sticky sessions are recommended for strict ordering.
Audit writes are synchronous (durability over throughput, by design).