How Do ML Engineers Evaluate RAG Pipelines Properly?

For ML engineers building RAG applications · Based on Hetzel Eval Maturity Phases Framework

// TL;DR

The Hetzel Eval Maturity Phases Framework gives ML engineers building RAG applications a structured path to evaluate both retrieval quality and answer generation. RAG pipelines have specific failure modes — wrong document retrieval, hallucination beyond retrieved context, incomplete synthesis, format errors — that map directly to the framework's scoring function approach. Treat your vector database as an external system dependency, embed retrieval state into traces for reproducible evals, and use the flywheel to continuously improve retrieval and generation based on real user queries from production.

Why are RAG pipelines especially hard to evaluate?

RAG pipelines combine retrieval and generation, creating a two-stage failure surface. Your retrieval can return the wrong documents, the right documents in the wrong order, or miss critical context entirely. Your generation can then hallucinate beyond what was retrieved, fail to synthesize multiple sources, or produce outputs in the wrong format. Evaluating only the final answer misses whether a correct answer came from correct retrieval or got lucky.

The Hetzel Eval Maturity Framework addresses this directly through its principle of evaluating the full trace, not just the final output. For RAG, this means scoring retrieval and generation independently.

How do you apply Level 1 and Level 2 to a RAG pipeline?

At Level 1, select 10–20 representative user queries that cover your RAG system's domain. Have a subject matter expert review each output and provide a thumbs up/down with a written justification. Pay special attention to justifications that reveal retrieval failures versus generation failures: "Thumbs down — the answer is about the wrong product because the retriever pulled the wrong docs" versus "Thumbs down — the retrieved docs were correct but the model hallucinated a feature."

At Level 2, derive failure modes from these justifications. Common RAG failure modes include: wrong documents retrieved, relevant documents ranked too low, hallucination beyond retrieved context, incomplete synthesis of multiple sources, outdated information retrieved, and format/structure errors.

For each failure mode, build the appropriate scoring function. Retrieval precision and recall can be measured deterministically if you have ground truth relevant documents. Answer faithfulness to retrieved context is a strong LLM-as-judge candidate — use the justification language from your annotations to write the judge prompt. Then validate that judge against human-labelled ground truth.

How do you handle the vector database as an external system dependency?

Your vector database is a context-gathering tool — it reads data and injects it into the LLM context without modifying external systems. This makes it lower risk than CRUD tools, but you still need reproducible retrieval state for your evals.

Embed the retrieved documents and their relevance scores directly into your captured trace payload. If your vector database supports timestamp queries, use them to replay the index state that existed when the original trace was captured — this is critical when your knowledge base is frequently updated. If timestamp queries aren't available, store the retrieval results as snapshots within the trace itself so your eval always references the same context the agent had at query time.

This ensures your evals reflect actual production conditions rather than the current index state, which may have changed.

How does the flywheel work specifically for RAG improvement?

Capture production query traces including the full retrieval results and generated answers. Surface failures through user feedback, human review, or automated scoring. Pull failing examples into your offline eval dataset.

Now you have a powerful improvement loop: when you experiment with chunking strategies, embedding models, reranking, or prompt changes, you rerun your production-derived eval dataset and measure the impact on both retrieval and generation scores. Each experiment produces quantifiable results: "Switching to semantic chunking improved retrieval precision from 0.68 to 0.81 on 200 production queries and answer faithfulness from 3.2 to 4.1 on our LLM-as-judge scale."

This turns RAG optimization from guesswork into engineering.

What should an ML engineer building RAG do next?

Identify your current maturity level. If you're evaluating RAG outputs informally, start Level 1 with an SME review this week. If you already have some annotations, extract failure modes and build your first scoring functions. Prioritize building separate retrieval and generation scorers — evaluating only the final answer hides whether improvements come from better retrieval or better generation. Set up trace capture that includes full retrieval results so you can build a production-representative eval dataset within your first two weeks of production usage.

// FREQUENTLY ASKED QUESTIONS

Should I evaluate retrieval and generation separately in my RAG pipeline?

Yes, the Hetzel framework's principle of evaluating the full trace means scoring individual steps, not just the final output. For RAG, this means separate scoring functions for retrieval quality (precision, recall, relevance ranking) and generation quality (faithfulness to retrieved context, completeness, accuracy). A correct final answer from wrong retrieval is a hidden failure that will eventually surface. Separate scoring lets you diagnose whether improvements should target the retriever or the generator.

How do I create ground truth for validating my RAG eval's LLM-as-judge?

Select a representative sample of RAG outputs (50–100 examples covering diverse query types). Have a domain expert label each output on your scoring dimensions — faithfulness to retrieved context, answer completeness, accuracy. Record written justifications for every judgment. This becomes your ground truth dataset. Run your LLM-as-judge against it and measure agreement. Because judge outputs are discrete (e.g., 1–5 scale), you can calculate exact alignment percentages and identify where the judge systematically disagrees with human experts.

What's the most common RAG eval mistake according to this framework?

Using only synthetic question-answer pairs instead of real production queries. The Hetzel framework's principle is that evals should approximate rerunning production. Synthetic examples miss the actual query patterns, edge cases, and phrasings your users employ. Start with synthetic data if you must at Level 1, but prioritize transitioning to captured production traces as quickly as possible. Real user queries reveal failure modes that no amount of synthetic data anticipates.