tips2027-11-097 min read

LLM Response Quality Score: Is Your Brand Described Correctly?

Name: thMenu
Rating: 4.9 (127 reviews)
Author: thMenu

4 LLMs, 12 weekly questions, accuracy + completeness + sentiment scoring. thMenu took LRQS from 6.4 to 9.1 in 14 months — here is the method.

thMenu Team

thmenu.com

"ChatGPT mentions us" is not enough; how it describes you is the decisive question. The LLM Response Quality Score (LRQS) reduces brand accuracy, completeness, and sentiment in AI answers to a single number. thMenu moved from 6.4 to 9.1 in 14 months — and entity building drove the biggest jump.

The Three Axes and the Formula

Each week we ask 4 LLMs (ChatGPT, Claude, Gemini, Perplexity) the same 12 questions: "what is thMenu", "thMenu pricing", "best QR menu software", "thMenu vs MenuTiger" and so on. Every answer gets three 1-10 scores.

Accuracy checks factual correctness (price, feature set, geography). Completeness counts how many of 8 key facts appear (need at least 6). Sentiment grades tone — negative 1-3, neutral 4-6, positive 7-10. Score = (accuracy × 0.5) + (completeness × 0.3) + (sentiment × 0.2). The weekly LRQS is the average of 48 answers.

14 Months: How 6.4 Became 9.1

Accuracy started at 5.8 — pricing was wrong, location missing, integrations confused. The first intervention was entity building: a Wikidata Q-ID, a Knowledge Graph panel, a Crunchbase and LinkedIn company profile. Accuracy hit 8.2 inside 4 months.

The second wave targeted completeness:

Schema.org SoftwareApplication and Organization markup site-wide
"thMenu vs X" comparison pages for 8 competitors
llms.txt and a canonical 60-line fact sheet

Sentiment climbed from 7.4 to 8.9 via PR, case studies, and replying to 12 stale negative threads on review sites that had been quietly dragging the average down.

Operational Setup

The weekly run takes 45 minutes: Monday morning we fire 48 queries (n8n plus LLM APIs), two human reviewers score independently, we average if Cohen kappa > 0.7, otherwise a third reviewer breaks the tie. Results land in a Notion dashboard with a 12-week trendline.

The action rule: if any axis falls below 7.0 for a week, we open a root-cause ticket with a 14-day deadline. Accuracy drops usually trace to a competitor launch or a stale fact; completeness drops usually mean undocumented new features.

FAQ

Are 12 questions enough? Pareto: 12 covers ~85% of real user intent. Doubling to 24 only cuts variance by 0.3 points while doubling cost.

Which tools automate this? Profound, AthenaHQ, and Peec AI sell similar metrics. An in-house Sheet plus LLM APIs costs ~$40/month and keeps the question set fully company-specific.

Fastest win? Open a Wikidata Q-ID (one day) plus a Knowledge Graph submission (2-6 weeks). That pair adds ~2.1 points to accuracy on average.

Found this helpful? Share it.

X / Twitter LinkedIn

✦▦

tips