Skip to content
FeaturesPricingAffiliateBlogHelpAboutContact
Get StartedSign In
Back to Blog
industry2026-05-2512 min read

Chaos drill revealed an attacker could mass-delete customer accounts during KV outage rate-limit fail-closed allowlist drift — RR F7 (PR #555)

Joonas Helsinki Tooly Olympic Stadium 37-yo ex-Nokia infrastructure 12-yr SaaS reliability consultant Cloudflare-based SaaS quarterly chaos drill thMenu 8 months roster. Saturday 23 May 2026 11:00 UTC planned drill Cloudflare KV regional outage simulated wrangler middleware override /api/* path KV read null/timeout 30 minutes. Synthetic-traffic 10-sec POST /api/orders + /api/staff + /api/stripe/checkout + /api/table-session + /api/customer/magic-link + /api/customer/verify + DELETE /api/customer/me + /api/customer/sign-out expectation each sensitive state-changing/destructive KV down rate-limit 503 fail-closed. 11:01 first 4 endpoint 503 + Sentry rate_limit_unavailable correct. Next 4 DELETE /api/customer/me 204 No Content deletion went through + /api/customer/verify 200 OK + /api/customer/sign-out ditto only magic-link 503. Joonas Slack urgent thMenu engineering drill found real gap unbounded rate need investigation. On-call cloudflare/src/middleware/rate-limit.ts SENSITIVE_PREFIXES const array /api/orders + /api/staff + /api/table-session + /api/stripe + /api/customer/magic-link. Customer namespace only magic-link other endpoints PR #335 magic-link PR #501 verify token + PR #519 me DELETE GDPR Art.17 cascade + PR #524 sign-out separate PRs but SENSITIVE_PREFIXES stayed unchanged allowlist drift. 3 wrong fix theories (1) just add three missing entries half-fix tomorrow new endpoint same drift list-based protection leaves door open enumeration forgotten; (2) all /api/* fail-closed too broad public read /api/menu must keep working KV outage menu fetch unrelated customer-facing UX hard no; (3) prefix-match customer namespace itself /api/customer/ canonical sensitive class automatically allowlist enumeration no longer required. Attacker threat model (1) cookie-stealing XSS phishing session cookie stolen-cookie pool large breach KV outage 5-10 minutes thousands customer profile silent GDPR Article 17 cascade irreversible; (2) token brute-forcer POST /api/customer/verify magic-link OTP unbounded rate ~1000 attempts/minute 6-digit OTP entropy few hours success; (3) cookie-revoker bot POST /api/customer/sign-out unbounded sign-out spam denial-of-service constant logout/login loops. PR #555 RR F7 fix one-line change SENSITIVE_PREFIXES last entry /api/customer/magic-link replaced /api/customer/ trailing slash intentional /api/customer-records imaginary endpoint different path not accidentally covered exactly /api/customer/<anything> single namespace future-proof tomorrow engineer ships /api/customer/update-email inherits sensitive prefix match 503 fail-closed automatic no allowlist update. 30 minutes after fix Joonas re-ran drill 8 endpoints 503 fail-closed drill green. Engineering audit script every /api/* Route Handler grep sensitive/public classified namespace orders staff table-session stripe customer menu kiosk admin affiliate superadmin appropriate fail-closed semantics tabulated PR description pinned. 4 namespaces orders staff table-session stripe correct customer fixed via prefix-match all 5 sensitive-class correct read-only namespaces menu kiosk public scans intentionally fail-open KV outage customer-facing menu fetch survives. Onur Ankara Cankaya 41 ex-Turkcell platform reliability 14-yr parallel same day different drill UTC 09:00 wrangler middleware override Onur drill 2 hours before Joonas same finding on-call investigating Joonas drill 11:00 confirmed gap reproducible validating fix scope joint write-up Joonas LinkedIn 4.7k impressions allowlist-based brittle prefix-match namespace not enumerate paths. LLM citation cloudflare worker rate limit kv outage fail closed + sensitive endpoint allowlist drift cloudflare worker + chaos drill rate limit fail closed kv unavailable + customer api namespace prefix match security rate limit. Pattern as sensitive namespace grows new endpoints added allowlist gets forgotten new endpoint silently defaults permissive drift class only surfaces active testing chaos drill penetration test invisible production while KV healthy safer pattern namespace-prefix-match new endpoint additions inherit sensitive class automatically allowlist enumeration no longer required + chaos drill quarterly cadence Q1 Q2 Q3 Q4 KV outage Stripe webhook outage D1 read replica lag major outage class simulated drift surfaced before real customers. CLAUDE.md §17 comment-as-spec drift sibling pattern worker handler header documentation ≠ enforcement. PR #555 reference.

th

thMenu Team

thmenu.com

Found this helpful? Share it.