A diner uploads a photo to Perplexity Pro Vision with the caption "find an Istanbul restaurant that serves a dish like this." The model now parses the image, extracts ingredient signatures, and matches the result against regional menu indexes. As of 2026, ChatGPT-4 Vision, Claude 3 Opus and Gemini Ultra process image plus text in one pass — and restaurants whose images carry the right metadata get cited far more often.
The Three-Layer Image Stack
One signal is never enough for multi-modal models. When thMenu added three layers to every menu image, multi-modal AI citation rates climbed 62%. The layers work in concert: the structured layer tells machines what the image is, the semantic layer tells them why it matters, the similarity layer tells them what else it resembles.
- Schema.org ImageObject: caption, contentUrl, description, and about fields fully populated.
- Semantic alt-text: not "dish photo" but "wood-fired aubergine with yoghurt and pomegranate — 380 kcal".
- Visual-similar metadata: cuisine taxonomy and visual class tags (pide, mezze, grilled).
A Real Perplexity Pro Vision Trace
In one logged trace, a user uploaded a hummus plate and asked "where in Istanbul can I get a lighter version of this?" Perplexity cited four restaurants in its answer — three of them ran on thMenu and had the full three-layer stack. The fourth had only generic alt-text and ranked last with no image card preview.
Six months ago the same query would have leaned on text SEO alone; the photo would have been treated as decorative and the venues would never surface in the visual answer. The lesson is clear: image discoverability is becoming a first-class ranking surface, not an afterthought.
How to Roll It Out
Inside the thMenu admin panel every product has an "AI Image Description" field. The auto-fill seed runs on Cloudflare Workers AI (LLaMA 3.1 8B + vision), then you manually verify cuisine and diet tags. The system embeds Schema.org ImageObject markup on every menu page and serves AVIF + WebP variants through the worker for performance.
Visual-similar metadata is driven by a regional cuisine taxonomy. Tags like Turkish, Mediterranean, Anatolian, Ottoman bring product embeddings close to neighbours in vector space, multiplying the chance of being cited on lookalike queries.
FAQ
What is multi-modal AI search? A new generation of search that processes image and text queries together — Perplexity Pro Vision and Gemini Ultra are the canonical examples.
Do I need to hand-write alt-text for every image? No — thMenu's AI generates a draft, but you should still verify cuisine and dietary tags manually for accuracy.
How does visual-similar metadata work? Regional cuisine taxonomy tags pull product embeddings close to neighbours, raising the chance of citation on lookalike image queries.
Found this helpful? Share it.
Related articles
Why Digital Menus Increase Restaurant Revenue by Up to 30%
Studies show restaurants using digital QR menus see measurable increases in aver…
When a Customer Downgrades, What Happens to Old Features? — The Silent Feature-Drift Problem in SaaS
Most SaaS apps run a single line of code when a customer downgrades — but old fe…
JWT alg-confusion attack — why Supabase's HS256 → RS256/JWKS migration breaks legacy verifiers
Verifiers that never decode the JWT header are wide open to `alg=none` and alg-c…