RAG Cookbook: RAG Validation with RAGProbe

RAGAICookbookTesting

Automated evaluation techniques for RAG systems using RAGProbe

Introduction

In our previous articles, we implemented basic RAG and enhanced it with retrieve-and-rerank capabilities. However, evaluating RAG systems remains challenging due to their complexity and the variety of potential failure points. This article introduces RAGProbe, an automated approach for evaluating RAG applications that was published in 24 Sep 2024. Later you will learn why publishing date was important.

The challenge with RAG systems lies in their multi-component nature: embedding generation, retrieval mechanisms, and response generation can each introduce errors. Traditional evaluation methods often miss subtle failures or provide incomplete coverage. RAGProbe addresses these challenges through systematic, automated testing.

To follow along, you should be familiar with the RAG concepts covered in previous articles. The code changes will focus on implementing evaluation scenarios and automated testing while maintaining the same core RAG functionality.

All relevant code changes will be contained in a single commit in the GitHub repository.

Series Overview

This is part of the RAG Cookbook series:

  1. Introduction to RAG
  2. Retrieve and rerank RAG
  3. RAG validation (RAGProbe) (This Article)
  4. Hybrid RAG
  5. Graph RAG
  6. Multi-modal RAG
  7. Agentic RAG (Router)
  8. Agentic RAG (Multi-agent)

Table of Contents

RAGProbe Explained

RAGProbe is an automated evaluation framework designed to systematically test RAG pipelines through various scenarios. It builds upon our existing retrieval and reranking system (referenced in the previous article) to provide comprehensive testing and validation.

Why Automated Evaluation

Our current RAG implementation (see chat.ts) handles retrieval and reranking but lacks systematic evaluation. RAGProbe addresses several key challenges:

  1. Retrieval Quality

    • Evaluating context relevance
    • Measuring retrieval precision
    • Assessing chunk selection
  2. Response Accuracy

    • Factual correctness
    • Answer completeness
    • Context utilization
  3. System Performance

    • Response latency
    • Resource utilization
    • Error handling

Evaluation Components

Building on our existing embedding update pipeline (see update-embeddings.ts), RAGProbe adds:

  1. Test Case Generation

    interface TestCase {
      query: string;
      expectedContext: string[];
      expectedAnswer: string;
      metadata: {
        category: string;
        difficulty: string;
        requires_synthesis: boolean;
      };
    }
    
  2. Evaluation Metrics

    interface EvaluationMetrics {
      retrieval: {
        precision: number;
        recall: number;
        relevance_score: number;
      };
      generation: {
        factual_accuracy: number;
        answer_completeness: number;
        context_utilization: number;
      };
      performance: {
        latency_ms: number;
        token_usage: number;
      };
    }
    

Implementation

The implementation extends our existing RAG system with evaluation capabilities:

1. Test Case Generation

We leverage our document processing pipeline to generate test cases:

async function generateTestCases(documents: Document[]): Promise<TestCase[]> {
  const testCases: TestCase[] = [];

  for (const doc of documents) {
    // Generate factual questions
    const factualQuestions = await generateFactualQuestions(doc);
    testCases.push(...factualQuestions);

    // Generate numerical questions
    const numericalQuestions = await generateNumericalQuestions(doc);
    testCases.push(...numericalQuestions);

    // Generate synthesis questions
    const synthesisQuestions = await generateSynthesisQuestions(doc);
    testCases.push(...synthesisQuestions);
  }

  return testCases;
}

2. Evaluation Pipeline

Building on our reranking implementation:

class RAGProbeEvaluator {
  constructor(private ragSystem: RAGSystem) {}

  async evaluateTestCase(testCase: TestCase): Promise<EvaluationMetrics> {
    // Evaluate retrieval
    const retrievedDocs = await this.ragSystem.retrieve(testCase.query);
    const retrievalMetrics = this.evaluateRetrieval(
      retrievedDocs,
      testCase.expectedContext,
    );

    // Evaluate generation
    const response = await this.ragSystem.generate(
      testCase.query,
      retrievedDocs,
    );
    const generationMetrics = this.evaluateGeneration(
      response,
      testCase.expectedAnswer,
    );

    return {
      retrieval: retrievalMetrics,
      generation: generationMetrics,
      performance: this.collectPerformanceMetrics(),
    };
  }
}

3. Metrics Collection

We extend our existing metrics collection:

interface MetricsCollector {
  recordRetrieval(metrics: RetrievalMetrics): void;
  recordGeneration(metrics: GenerationMetrics): void;
  recordPerformance(metrics: PerformanceMetrics): void;
  generateReport(): EvaluationReport;
}

4. Reporting System

The reporting system integrates with our existing logging:

async function generateEvaluationReport(
  testResults: TestResult[],
): Promise<Report> {
  return {
    summary: computeSummaryMetrics(testResults),
    detailed: generateDetailedAnalysis(testResults),
    recommendations: generateRecommendations(testResults),
  };
}

Conclusions

RAGProbe provides systematic validation for our RAG implementation. By integrating automated testing into our pipeline, we can:

  • Continuously monitor retrieval quality
  • Ensure response accuracy
  • Identify performance bottlenecks
  • Guide system improvements

The next article will explore Hybrid RAG architectures, building on this validated foundation.

References