Module rageval

rageval — TypeScript RAG pipeline evaluation library.

The RAGAS-inspired equivalent for Node.js. Evaluate the quality of your Retrieval-Augmented Generation pipeline with LLM-as-judge scoring.

Quick Start

import Anthropic from '@anthropic-ai/sdk'
import { evaluate, faithfulness, contextRelevance, answerRelevance } from 'rageval'

const results = await evaluate({
  provider: { type: 'anthropic', client: new Anthropic(), model: 'claude-haiku-4-5-20251001' },
  dataset: [
    {
      question: 'What is the capital of France?',
      answer: 'The capital of France is Paris.',
      contexts: ['France is a country in Western Europe. Its capital city is Paris.'],
      groundTruth: 'Paris',
    },
  ],
  metrics: [faithfulness, contextRelevance, answerRelevance],
})

console.log(results.scores)
// { faithfulness: 0.97, contextRelevance: 0.91, answerRelevance: 0.95, overall: 0.94 }

Score Interpretation

All scores are in the range [0, 1]:

0.9 – 1.0 — Excellent
0.7 – 0.9 — Good
0.5 – 0.7 — Fair — consider reviewing retrieval or prompts
< 0.5 — Poor — pipeline needs attention

Important Notes

Scores are non-deterministic by nature (LLM outputs vary). Treat differences smaller than ±0.03 as noise. Use temperature: 0 in your provider config for reproducible benchmarks. See the README for full guidance.