Webhook retry mechanism never worked since launch — schema drift, cron silent crash (PR #529 II)
Mathieu (39) chef-owner Brussels Saint-Gilles "Maison Sainte-Croix" 38-cover modern Belgian, 16 months thMenu, 8 months outbound webhook → https://crm.maisonsaintecroix.be/webhook/thmenu HubSpot pipeline. Subscription docs: "Retry on 5xx with exponential backoff up to 5 attempts." Mid-May Sunday sales report: 142 reservations (thMenu) vs 128 deals (HubSpot) = **14 dropped events**. Webhook Delivery Log 14 entries "failed" + attempts: 1 (promised 5). Emailed support. Engineering forensic wrong theories busted: (1) 5xx not enrolled — apps/web-admin/src/lib/webhooks/dispatch.ts INSERT pending verifying; (2) handleScheduled wire missing — cloudflare/src/index.ts xx:00 + xx:30 webhook-retry runCronSafe call verifying, Cloudflare Workers logs 8,640 webhook-retry log entry last 6 months. **(3) Right theory: schema-vs-code drift**. Every log entry payload: D1_ERROR: no such column: payload. Cron crashes every tick + runCronSafe try/catch swallows. Migration 0011 at launch: payload_size_bytes INTEGER, attempt_number INTEGER, status_code INTEGER, delivered_at TEXT, error TEXT, status TEXT. Cron since launch SELECT: event_id, **payload**, **attempts**, **next_attempt_at**, status, **last_status_code**, **last_response_at** — 5 column name mismatch + next_attempt_at column doesn t exist. Schema implementer A mental model, cron implementer B mental model, code review missed. **16 months silent retry failure**. Every tick crash, runCronSafe swallows, console.error logging existed but nobody watched logs, no Sentry alert, no dashboard widget. SELECT status, COUNT(*) FROM webhook_delivery_log GROUP BY status: 8,247 pending, 0 succeeded/dead. **PR #529 batch II** 3-layer fix: **Layer 1 schema align** migration 0074: payload TEXT + attempts INTEGER + next_attempt_at TEXT + last_status_code INTEGER + last_response_at TEXT added, backfill existing rows (attempts = attempt_number etc), old columns deprecated. **Layer 2 atomic retry claim**: race condition between concurrent cron ticks avoided via `UPDATE ... RETURNING` SQLite 3.35+ D1 supports. UPDATE webhook_delivery_log SET status = "in_progress", attempts = attempts + 1 WHERE event_id IN (SELECT ... WHERE status = "pending" AND next_attempt_at < ? LIMIT 50) RETURNING event_id, payload, attempts, subscription_id, type. **Layer 3 observability**: runCronSafe catch branch Sentry beacon + alert rule [BEACON:cron_failed] 5+/hour threshold PagerDuty; ops dashboard cron success rate widget; migration-drift-check cron (PR #333) PRAGMA table_info dump vs critical table column list. **Restore**: 8,247 pending events first post-fix run picked up, ~480 recovered (still within retry window 5min-6h), ~6,200 already-dead (older than 6h) — Engineering created "lost notifications" ZIP archive + DM d to restaurant ownerships. Mathieu 14 failed → 2 recovered, 12 manually HubSpot entered + 6-month free Pro tier + sympathetic apology. 23 restaurants publicly credited thMenu on Twitter Spaces. Pattern: **in D1/PostgreSQL/MySQL, schema migrations and code (cron/handler/repository) are written in parallel by different developers + code review can miss schema drift + try/catch wrappers create silent-failure modes. Mitigation: (1) PRAGMA table_info() dump vitest fixture, CI diff; (2) Sentry beacon + alert rule + success-rate widget; (3) atomic UPDATE ... RETURNING race-safe claim.** Implementation checklist: (1) PRAGMA snapshot fixture per critical table; (2) runCronSafe catch Sentry beacon [BEACON:cron_failed]; (3) ops dashboard cron success rate widget; (4) migration-drift-check cron critical column list; (5) pre-merge CI guard migration + repo code reviewed together; (6) atomic UPDATE RETURNING D1 supports; (7) quarterly schema-vs-code audit. Sercan Mardin Eski Kapi Pipedrive version with same flow.
thMenu Team
thmenu.com
Found this helpful? Share it.
Related articles
Why Digital Menus Increase Restaurant Revenue by Up to 30%
Studies show restaurants using digital QR menus see measurable increases in aver…
When a Customer Downgrades, What Happens to Old Features? — The Silent Feature-Drift Problem in SaaS
Most SaaS apps run a single line of code when a customer downgrades — but old fe…
JWT alg-confusion attack — why Supabase's HS256 → RS256/JWKS migration breaks legacy verifiers
Verifiers that never decode the JWT header are wide open to `alg=none` and alg-c…