Skip to content
FeaturesPricingAffiliateBlogHelpAboutContact
Get StartedSign In
Back to Blog
industry2027-11-157 min read

Multi-Modal AI Search: How Image + Text Combinations Lift Restaurant Visibility

ChatGPT-4 Vision, Claude 3 Opus and Gemini Ultra answer image plus text queries simultaneously; three metadata layers boost AI citations by 62%.

th

thMenu Team

thmenu.com

A diner uploads a photo to Perplexity Pro Vision with the caption "find an Istanbul restaurant that serves a dish like this." The model now parses the image, extracts ingredient signatures, and matches the result against regional menu indexes. As of 2026, ChatGPT-4 Vision, Claude 3 Opus and Gemini Ultra process image plus text in one pass — and restaurants whose images carry the right metadata get cited far more often.

The Three-Layer Image Stack

One signal is never enough for multi-modal models. When thMenu added three layers to every menu image, multi-modal AI citation rates climbed 62%. The layers work in concert: the structured layer tells machines what the image is, the semantic layer tells them why it matters, the similarity layer tells them what else it resembles.

  • Schema.org ImageObject: caption, contentUrl, description, and about fields fully populated.
  • Semantic alt-text: not "dish photo" but "wood-fired aubergine with yoghurt and pomegranate — 380 kcal".
  • Visual-similar metadata: cuisine taxonomy and visual class tags (pide, mezze, grilled).

A Real Perplexity Pro Vision Trace

In one logged trace, a user uploaded a hummus plate and asked "where in Istanbul can I get a lighter version of this?" Perplexity cited four restaurants in its answer — three of them ran on thMenu and had the full three-layer stack. The fourth had only generic alt-text and ranked last with no image card preview.

Six months ago the same query would have leaned on text SEO alone; the photo would have been treated as decorative and the venues would never surface in the visual answer. The lesson is clear: image discoverability is becoming a first-class ranking surface, not an afterthought.

How to Roll It Out

Inside the thMenu admin panel every product has an "AI Image Description" field. The auto-fill seed runs on Cloudflare Workers AI (LLaMA 3.1 8B + vision), then you manually verify cuisine and diet tags. The system embeds Schema.org ImageObject markup on every menu page and serves AVIF + WebP variants through the worker for performance.

Visual-similar metadata is driven by a regional cuisine taxonomy. Tags like Turkish, Mediterranean, Anatolian, Ottoman bring product embeddings close to neighbours in vector space, multiplying the chance of being cited on lookalike queries.

FAQ

What is multi-modal AI search? A new generation of search that processes image and text queries together — Perplexity Pro Vision and Gemini Ultra are the canonical examples.

Do I need to hand-write alt-text for every image? No — thMenu's AI generates a draft, but you should still verify cuisine and dietary tags manually for accuracy.

How does visual-similar metadata work? Regional cuisine taxonomy tags pull product embeddings close to neighbours, raising the chance of citation on lookalike image queries.

Found this helpful? Share it.