Generative AI / Evaluation

Evaluating GenAI Accuracy with Retrieval Metrics

Financial advisers at Morgan Stanley and mission engineers at NASA trust their chatbots because each answer is scored for recall, precision and groundedness before a user ever sees it. Microsoft’s new RAG guide and Pinecone’s evaluation toolkit show how to bake those checks into every release pipeline.

Carter Reynolds4 July 2025

Evaluating GenAI Accuracy with Retrieval Metrics

The Retrieval-to-Answer Scorecard

Why retrieval metrics matter more than model mystique

Morgan Stanley’s AskResearchGPT parses 100 000 PDF notes yet passes compliance only because every sentence is foot-noted back to source; engineers gate the LLM behind a retriever that logs recall and precision on every nightly build. morganstanley.commarketsmedia.com NASA’s open-sourced VECTOR assistant does the same for flight-software manuals, tracking recall@5 and groundedness so mission rules remain intact. github.com

The core scorecard

MetricWhat it tells youHealthy target*Recall @ kGold passages found in top k≥ 0.90 (k = 20)Precision @ kShare of top k that are relevant≥ 0.80 (k = 20)nDCGRank-aware usefulness≥ 0.85Groundedness% of answer tokens that align to retrieved text≥ 0.95Context useShare of supplied context the LLM actually cites0.70 – 0.85

Microsoft’s Azure AI Foundry groups groundedness, completeness and relevancy under “response quality”, while Pinecone labels recall & precision as “retrieval quality”. learn.microsoft.compinecone.iodocs.pinecone.io

*Targets taken from Microsoft RAG evaluator defaults and Pinecone field reports.

A reference evaluation pipeline

Label Pick 200-500 real user queries and tag the gold passages (BEIR or ticket history). pinecone.io
A/B run Query baseline retriever (BM25 + vector) and candidate (new embeddings).
Score Compute recall@k, precision@k, nDCG with the BEIR harness.
Rerank swap-test Drop in Cohere-r2 or bge-reranker-base, re-score.
Response check Feed top passages plus prompt to GPT-4o with a JSON rubric that grades groundedness and completeness. learn.microsoft.com
Drift watch Ship metrics to Langfuse or Prometheus; alert if recall sinks below SLA.

Field results

OrganisationMetric focusOutcomeMorgan StanleyPrecision@20 & groundedness< 3 % hallucination, adviser adoption 200 users morganstanley.commarketsmedia.comNASA VECTORRecall@5, groundednessGroundedness ↑ 9 pts after rerank, now 97 % github.com

Quick-win pilots (4–6 weeks)

PilotEffortProof-pointP-1 Mini gold set – label 200 queries1 wkBaseline recall & precisionP-2 Reranker bake-off – Cohere vs. bge1 wkChoose best scorerP-3 Groundedness harness – JSON rubric + GPT1 wkAuto-citation checkP-4 Drift dashboard – Langfuse alerts1 wkOps sees metric dips first

References

Morgan Stanley. “AskResearchGPT launch” (Oct 2024). morganstanley.com
Microsoft Learn. “RAG evaluators for relevance, groundedness and completeness” (May 2025). learn.microsoft.com
Pinecone Blog. “RAG Evaluation: Don’t let customers tell you first” (Apr 2025). pinecone.io

BEIR Benchmark – GitHub repo (2025). pinecone.io
Pinecone Docs. “Evaluation overview” (2025). docs.pinecone.io
NASA JPL. “VECTOR: Retrieval-augmented chat for mission ops” (GitHub, 2024). github.com

Sources

Build Your First Reliable AI Agent System

Move beyond AI experiments. Microcorem helps organisations design agentic workflows, retrieval systems, evaluation pipelines, and production-ready LLM applications.

Book an AI Systems Audit Explore AI Engineering Services

Evaluating GenAI Accuracy with Retrieval Metrics

The Retrieval-to-Answer Scorecard

Why retrieval metrics matter more than model mystique

The core scorecard

A reference evaluation pipeline

Field results

Quick-win pilots (4–6 weeks)

References

Build Your First Reliable AI Agent System

Building LLM Applications Is Not Prompt Engineering

Globexa-Enterprise: A Dual-View Architecture for Reliable, Scalable Microservices-Conversational Commerce

Globexa-Growth Premium: Scale, Automate & Partner

The Retrieval-to-Answer Scorecard

Why retrieval metrics matter more than model mystique

The core scorecard

A reference evaluation pipeline

Field results

Quick-win pilots (4–6 weeks)

References

Build Your First Reliable AI Agent System

Other AI Systems Insights

Building LLM Applications Is Not Prompt Engineering

Globexa-Enterprise: A Dual-View Architecture for Reliable, Scalable Microservices-Conversational Commerce

Globexa-Growth Premium: Scale, Automate & Partner