How to Evaluate RAG Retrieval: Build the Eval First

Four Rebuilds, Zero Measurements

I deployed a chatbot for a dental clinic in Lithuania. The knowledge base was PostgreSQL full-text search — the standard approach. It worked in demos. Then the first real user typed a question in Lithuanian.

Zero results.

The FTS index used an English stemmer. Lithuanian stems differently — "paslaugos" becomes "paslaug" in Lithuanian, "paslaugo" in English. Different stems, no match. I knew about this when I shipped it. I deferred the fix because the agent could translate queries as a workaround. Five days later the workaround was not working.

That was rebuild one. Rebuild two came the next day when I realized FTS only returned the first 400 characters of a matching page. Pricing information was buried at character 2500 and beyond — and forty percent of real queries were about pricing. FTS could find the page but not the answer. I added pgvector semantic search: chunked every page into 800-character paragraphs, embedded with OpenAI, built an HNSW index.

Rebuild three: Reciprocal Rank Fusion. Combined FTS and vector results in parallel — 51 milliseconds average. Then a metadata leg with JSONB matching on service categories and priority boosts for FAQ content. Three-way hybrid with configurable weights.

Rebuild four: LLM contextual retrieval. Chunks that referenced "our services" without naming the specific service needed disambiguation. Following Anthropic's contextual retrieval approach, I added LLM-generated snippets that situated each chunk within its source page.

I also learned that Gemini's embedding API projects text differently depending on whether you label it as a query or a document — RETRIEVAL_QUERY versus RETRIEVAL_DOCUMENT. Using the wrong task type silently degrades cosine similarity. No error, no warning. Just worse results you cannot explain.

Four architectures across five months. Each one fixed a real problem. None had a baseline to prove it made retrieval better overall.

What 42 Production Queries Told Me

Five months in, I built the eval. Forty-two golden queries from real patient conversations — not synthetic test cases. Real questions people typed between November 2025 and March 2026.

Evaluate RAG retrieval with a golden dataset of real user queries categorized by intent. I used 42 production queries across pricing (26%), services (29%), FAQ (38%), and doctor-specific (7%) categories, measured with Precision@3, Recall@3, MRR@3, HitRate@3, and ContentContains@3. A regression gate in CI asserts ±0.01 tolerance to catch quality drops before deployment. As of Q2 2026.

The category split matters because retrieval fails differently by type. Doctor queries had a 33% hit rate. Pricing queries hit 100% — the pgvector rebuild had actually worked, though I did not know that until I measured it five months later. Overall baseline: MRR@3 of 0.5992 and HitRate@3 of 0.7619.

The first optimization after building the eval was a Cohere reranker. Every RAG tutorial recommends adding one. It measured +0.024 Precision@3 on the same 42 queries. I left it off — the gain did not justify the dependency and recurring cost. Without the eval, I would have shipped it and assumed it helped.

Build the Eval Before Iteration Two

Build your RAG eval after the first retrieval attempt works at all — before the second optimization. Your first retrieval is the baseline, not the product. Twenty real queries with intent categories is enough to start. The eval tells you whether iteration two is better or just different. Four blind rebuilds cost me five months. Building the eval took one day.

Extract queries from your logs or support tickets — whatever your users actually ask. Tag each with an intent category and the document that should rank first. Run your retrieval, score it, and you have a baseline. Now every change you make is measurable.

Seventy percent of RAG systems still have no systematic evaluation framework. I was in that seventy percent for five months. Some of my rebuilds helped. Some probably did not. I will never know which.

Your first retrieval system is not a product. It is a baseline.

If you are building retrieval and want to skip the months of blind iteration, get in touch.