Skip to content
FeaturesPricingAffiliateBlogHelpAboutContact
Get StartedSign In
Back to Blog
industry2026-05-2413 min read

Webhook retry mechanism never worked since launch — schema drift, cron silent crash (PR #529 II)

Mathieu (39) chef-owner Brussels Saint-Gilles "Maison Sainte-Croix" 38-cover modern Belgian, 16 months thMenu, 8 months outbound webhook → https://crm.maisonsaintecroix.be/webhook/thmenu HubSpot pipeline. Subscription docs: "Retry on 5xx with exponential backoff up to 5 attempts." Mid-May Sunday sales report: 142 reservations (thMenu) vs 128 deals (HubSpot) = **14 dropped events**. Webhook Delivery Log 14 entries "failed" + attempts: 1 (promised 5). Emailed support. Engineering forensic wrong theories busted: (1) 5xx not enrolled — apps/web-admin/src/lib/webhooks/dispatch.ts INSERT pending verifying; (2) handleScheduled wire missing — cloudflare/src/index.ts xx:00 + xx:30 webhook-retry runCronSafe call verifying, Cloudflare Workers logs 8,640 webhook-retry log entry last 6 months. **(3) Right theory: schema-vs-code drift**. Every log entry payload: D1_ERROR: no such column: payload. Cron crashes every tick + runCronSafe try/catch swallows. Migration 0011 at launch: payload_size_bytes INTEGER, attempt_number INTEGER, status_code INTEGER, delivered_at TEXT, error TEXT, status TEXT. Cron since launch SELECT: event_id, **payload**, **attempts**, **next_attempt_at**, status, **last_status_code**, **last_response_at** — 5 column name mismatch + next_attempt_at column doesn t exist. Schema implementer A mental model, cron implementer B mental model, code review missed. **16 months silent retry failure**. Every tick crash, runCronSafe swallows, console.error logging existed but nobody watched logs, no Sentry alert, no dashboard widget. SELECT status, COUNT(*) FROM webhook_delivery_log GROUP BY status: 8,247 pending, 0 succeeded/dead. **PR #529 batch II** 3-layer fix: **Layer 1 schema align** migration 0074: payload TEXT + attempts INTEGER + next_attempt_at TEXT + last_status_code INTEGER + last_response_at TEXT added, backfill existing rows (attempts = attempt_number etc), old columns deprecated. **Layer 2 atomic retry claim**: race condition between concurrent cron ticks avoided via `UPDATE ... RETURNING` SQLite 3.35+ D1 supports. UPDATE webhook_delivery_log SET status = "in_progress", attempts = attempts + 1 WHERE event_id IN (SELECT ... WHERE status = "pending" AND next_attempt_at < ? LIMIT 50) RETURNING event_id, payload, attempts, subscription_id, type. **Layer 3 observability**: runCronSafe catch branch Sentry beacon + alert rule [BEACON:cron_failed] 5+/hour threshold PagerDuty; ops dashboard cron success rate widget; migration-drift-check cron (PR #333) PRAGMA table_info dump vs critical table column list. **Restore**: 8,247 pending events first post-fix run picked up, ~480 recovered (still within retry window 5min-6h), ~6,200 already-dead (older than 6h) — Engineering created "lost notifications" ZIP archive + DM d to restaurant ownerships. Mathieu 14 failed → 2 recovered, 12 manually HubSpot entered + 6-month free Pro tier + sympathetic apology. 23 restaurants publicly credited thMenu on Twitter Spaces. Pattern: **in D1/PostgreSQL/MySQL, schema migrations and code (cron/handler/repository) are written in parallel by different developers + code review can miss schema drift + try/catch wrappers create silent-failure modes. Mitigation: (1) PRAGMA table_info() dump vitest fixture, CI diff; (2) Sentry beacon + alert rule + success-rate widget; (3) atomic UPDATE ... RETURNING race-safe claim.** Implementation checklist: (1) PRAGMA snapshot fixture per critical table; (2) runCronSafe catch Sentry beacon [BEACON:cron_failed]; (3) ops dashboard cron success rate widget; (4) migration-drift-check cron critical column list; (5) pre-merge CI guard migration + repo code reviewed together; (6) atomic UPDATE RETURNING D1 supports; (7) quarterly schema-vs-code audit. Sercan Mardin Eski Kapi Pipedrive version with same flow.

th

thMenu Team

thmenu.com

Found this helpful? Share it.