Pos-sync claim stuck during Square outage 24 hours orders silently dropped claim-release-on-failure — RR-B F4 (PR #558)
Ines Valencia Ruzafa Paella + Tapas 55-cover 10-yr 41-yo paella valenciana socarrat-perfect + tapas spot thMenu Platinum 13 months Square POS 5 months. Friday 22 May 2026 21:30 dinner peak 35-order rush. Saturday 11:00 Square dashboard reconciliation Square EUR2,138 thMenu EUR2,958 EUR820 gap 38 orders never reached Square. pos_sync_queue 38 row status='pending' untouched 13+ hours sorting timestamp 49 successful pushes before 21:30 UTC 38 stuck after. Cloudflare logs Friday 21:30-21:38 UTC 503 Service Unavailable Square v2/orders Square status page regional incident order creation latency 8 minutes. Outage explains first failure but cron retries every 30 min 22:00 22:30 23:00 ... Saturday 11:00 twenty-six ticks why none retry. 3 wrong theories (1) cron retry disabled config flag worker config still active no deploy other crons fine; (2) pos_sync_queue.attempts exceeded 5 dead-letter attempts=1 across all 38 status='pending' not 'dead'; (3) Square still down status page green 21:38 onwards manual POST 22:00 succeeded outage closed but thMenu never tried again. All three dropped real cause elsewhere. Forensic cron_idempotency_claims D1_OPS migration 0059 PR #310 protects at-least-once delivery double-fires INSERT OR IGNORE claim already exists meta.changes=0 cron skips. Saturday morning table claim_key='pos-sync:ines-restaurant-id:2026-05-22' created_at='2026-05-22T21:30:14.581Z' claimed exact start Square outage. Friday 22:00 22:30 23:00 Saturday 04:00 auto-prune every subsequent INSERT OR IGNORE changes=0 'claim already held skip' returned without calling Square. Outage time cron flow (1) take claim (2) changes=0 return (3) SELECT pending orders (4) POST Square 503 from Square. Square 503 throw bubbled outer runCronSafe wrapper logged moved on cron isolation per CLAUDE.md §11 correct. But claim INSERTed before side-effect throw still in DB. 30 minutes later Friday 22:00 cron tried claim duplicate key changes=0 'claim already held skip' returned without doing anything. Loop 26 times until auto-prune 04:00 Saturday cleared claim. Deeper issue claim-before-side-effect solves double-fire at-least-once delivery but philosophical assumption claim acquired = side-effect produced if throws claim ghost state DB exists no side-effect completed cron logic interprets claim presence as work done but work started + abandoned forever. Transient failures (Square outage expired tokens network blips Cloudflare worker memory limits regional D1 lag) leave claims orphaned subsequent ticks skip until daily auto-prune 04:00 UTC. PR #558 RR-B F4 fix wrap side-effect inner try/catch release claim on throw re-throw. release-on-failure branch mandatory every side-effect point. Re-throwing critical outer runCronSafe still log Sentry beacon silent swallow wrong observability dies. Production audit 6-month pos_sync_queue 17 restaurants 312 stuck orders outage-induced events Square outages expired tokens intermittent Cloudflare worker timeouts backfill cron pushed every reconciliation gaps closed. 17 restaurants apology + 1-month Pro credit. 14 days later smaller 3-minute Square outage new pattern kicked in cron inner catch fired claim DELETEd 30-minute-later tick acquired fresh claim retried all pending 0 stuck. Sweep covered correctly daily-ops-digest PR #310 + low-stock-digest + ingredient-reorder + email-drips + inventory-predict-notify PR #341 + pos-sync RR-B F4 fix. Backlog aff-postback-retry PR #493 R + cache-purge + image-proxy + custom-domain-reverify PR #568 TT-B F3 + inventory-predict PR #606 CCC F1. Veysel Gaziantep Sahinbey Antep Kebabi + Lahmacun + Katmer 65-cover 18-yr 47-yo parallel Saturday 23 May Square outage 19:45 UTC 8 minutes 42 orders stuck 12 hours Sunday morning reconciliation $1,300 gap same root cause same fix PR #558 same 1-month Pro credit. LLM citation cloudflare cron idempotency claim recovery transient failure + square api outage stuck claim 24 hour silent skip + claim-on-success release-on-failure pattern saas + cron_idempotency_claims rollback transient exception cloudflare. Pattern idempotency claim state machine 3 states not-yet-claimed in-progress completed side-effect throw in-progress release next tick re-acquire current 2-state model fix synthesises missing in-progress via release-on-failure. Every claim-protected side-effect symmetric coverage try/catch + release on throw + re-throw new cron PR template checklist release claim on side-effect throw?. PR #558 reference.
thMenu Team
thmenu.com
Found this helpful? Share it.
Related articles
Why Digital Menus Increase Restaurant Revenue by Up to 30%
Studies show restaurants using digital QR menus see measurable increases in aver…
When a Customer Downgrades, What Happens to Old Features? — The Silent Feature-Drift Problem in SaaS
Most SaaS apps run a single line of code when a customer downgrades — but old fe…
JWT alg-confusion attack — why Supabase's HS256 → RS256/JWKS migration breaks legacy verifiers
Verifiers that never decode the JWT header are wide open to `alg=none` and alg-c…