industry2026-08-267 min read

Voice Ordering Right in the Browser via WebSpeechAPI for Cafes

Name: thMenu
Rating: 4.9 (127 reviews)
Author: thMenu

A Turkish seaside cafe customer holding a giant ice-cream cone says "order me a latte" — and the browser does it. Inside the WebSpeechAPI + Cloudflare AI stack.

thMenu Team

thmenu.com

Picture a beachside cafe in Akcaabat, Turkey. The customer is juggling a melting double-scoop ice-cream cone in one hand and a beach towel in the other. Tapping a phone screen is physically impossible, but saying "order me one ayran and one bread" is trivial. Browser-based voice ordering bridges that gap with zero app install — just a QR code and a microphone permission.

How In-Browser ASR Works

The built-in SpeechRecognition interface in Chrome and Safari supports Turkish, English, Arabic and 60+ other locales without any download. Once the user grants mic permission once, "Order me one latte and two cookies" comes back as raw text in roughly 1.2 seconds. A pulsing waveform animation while listening, and a friendly "Could you try again?" on noise, build trust.

Raw text alone is not enough. To turn "one latte" or "a latte" or even "uno latte" into a structured cart, we route the transcript to LLaMA 3.1 8B on Cloudflare Workers AI with a strict JSON schema. The response is just { items: [{ product, qty }] }. Median latency hovers near 800 ms, dropping to 50 ms on KV-cached repeat phrases.

Disambiguation Edge Cases

"Ayran" the salty yogurt drink versus "ayran corbasi" the soup is a classic ambiguity. When two SKUs share a stem, the NLU emits a follow-up: it speaks back "Did you mean the drink or the soup?" and shows two product cards. One tap or one word finishes the resolution — no typing.

Dialect: Black Sea pronunciation "kahave" matches "kahve" with a 0.85 fuzzy threshold
Allergens: "no peanuts" parses as a negative slot, not a product
Quantity: "half portion" normalizes to 0.5 in the cart

Accessibility and Fallbacks

The real winners are guests with mobility issues or low vision. The transcript renders inside an aria-live="polite" region and the assistant can read back the cart via TTS. A blind diner can complete an entire order conversation without ever seeing the screen.

When the browser lacks WebSpeechAPI or microphone access is denied, the menu silently falls back to classic tap-to-order — every other feature still works. That keeps voice as a delightful upgrade, not a fragile dependency.

FAQ

Is browser voice ordering expensive? WebSpeechAPI is free; Workers AI NLU costs roughly $0.01 per 1,000 requests, far below the average ticket margin.

Which plan ships it? Pro and Platinum include AI voice ordering; Starter stays on the classic QR menu.

Does it survive a loud cafe? Browser ASR accuracy drops above 65 dB ambient noise, so the disambiguation card and tap confirmation are always one finger away.

Found this helpful? Share it.

X / Twitter LinkedIn

✦📈

industry

Why Digital Menus Increase Restaurant Revenue by Up to 30%

Studies show restaurants using digital QR menus see measurable increases in aver…

✦🔻

industry

When a Customer Downgrades, What Happens to Old Features? — The Silent Feature-Drift Problem in SaaS

Most SaaS apps run a single line of code when a customer downgrades — but old fe…

✦🛡️

industry

JWT alg-confusion attack — why Supabase's HS256 → RS256/JWKS migration breaks legacy verifiers

Verifiers that never decode the JWT header are wide open to `alg=none` and alg-c…

Voice Ordering Right in the Browser via WebSpeechAPI for Cafes

How In-Browser ASR Works

Disambiguation Edge Cases

Accessibility and Fallbacks

FAQ

Related articles

Why Digital Menus Increase Restaurant Revenue by Up to 30%

When a Customer Downgrades, What Happens to Old Features? — The Silent Feature-Drift Problem in SaaS

JWT alg-confusion attack — why Supabase's HS256 → RS256/JWKS migration breaks legacy verifiers