Skip to content
FeaturesPricingAffiliateBlogHelpAboutContact
Get StartedSign In
Back to Blog
guides2027-11-197 min read

AI Search Lab: A Weekly 1-Hour LLM Query Test Workflow

18 standard queries, 4 LLMs, one hour a week. thMenu has run this for 11 months and detects citations 14% more accurately than automated tools.

th

thMenu Team

thmenu.com

You don't need an expensive SaaS to measure your brand visibility in AI search. For 11 months, thMenu has run a 1-hour "AI Search Lab" every Thursday — 18 standard queries, 4 LLMs, logged manually to a Google Sheet. The payoff: 14% more accurate citation detection than automated tracking tools and almost zero false-positives.

The 18-Query Standard Set

Each week the same 18 queries run — only the answers change. Without a fixed set you can't analyze trends. The distribution: 3 brand-specific ("what is thMenu", "thMenu pricing", "thMenu vs Square"), 6 comparison ("best QR menu 2027", "QR menu for small restaurants"), 6 informational ("how to set up a QR menu", "waiter call system") and 3 voice-style ("hey siri what is the best qr menu app").

Voice-style queries matter more than they used to — after Apple Intelligence and Gemini's natural-language rollout in 2026, conversational search hit a 38% share. Skip this bucket and you blind yourself to a third of inbound intent.

4-LLM Comparison

Every query runs on four engines: ChatGPT (GPT-5), Claude (Opus 4.7), Gemini (2.5 Pro), and Perplexity. That's 72 tests per week. thMenu's running average: citations in 32 of those, a 44% visibility rate. This metric is the leading indicator for your future referral traffic from AI surfaces.

For each test we record four columns in the Sheet: (1) citation present or not, (2) which page was cited, (3) which competitor was also cited, (4) tone (positive / neutral / negative). Manual reading catches irony, hedging, and ordering nuance that scrapers miss every time.

Manual + Automated Hybrid Edge

SaaS trackers (Profound, Goodie, Otterly) are fast but blind — they flag a citation as "we appeared" without knowing whether the mention was positive, negative, or even pointed at the right URL. Our 47-week comparison log shows automated tools generated 14 false-positives over 11 weeks — our manual routine generated 1.

Ideal stack: automated tool mid-week for volume, 1-hour manual verification on Thursday for quality. Cost: automated subscriptions run $99-$299/month; manual routine is roughly 4 staff-hours/month, about $80. Combined, you reach 94% signal accuracy.

FAQ

Is one LLM enough? No — citation overlap across engines is only 31%. Visibility on one doesn't imply visibility on the others.

Should I change the query set? Keep the core 18 stable for 12+ months; only add 2-3 fresh voice-style queries per quarter.

How do you report results? A weekly visibility-percentage chart plus a sentiment heatmap in the Sheet. Monthly summary fits on one page.

Found this helpful? Share it.