Evaluation configuration. See EvaluateOptions.
A detailed EvaluationResult with per-sample and aggregate scores.
When thresholds are set and one or more metric
aggregates fall below their minimum. The thrown error contains the full
result so you can inspect scores even on failure.
On invalid dataset (empty, wrong shape), unknown provider, or unrecoverable LLM provider errors.
import Anthropic from '@anthropic-ai/sdk'
import { evaluate, faithfulness, answerRelevance } from 'rageval'
const results = await evaluate({
provider: { type: 'anthropic', client: new Anthropic(), model: 'claude-haiku-4-5-20251001' },
dataset: [
{
question: 'What is the capital of France?',
answer: 'The capital of France is Paris.',
contexts: ['France is a country in Western Europe. Its capital is Paris.'],
},
],
metrics: [faithfulness, answerRelevance],
thresholds: { faithfulness: 0.8 },
})
console.log(results.scores)
// { faithfulness: 0.97, answerRelevance: 0.95, overall: 0.96 }
Evaluates the quality of a RAG pipeline against a labelled dataset.
Uses the LLM-as-judge pattern: each metric sends a structured prompt to the chosen LLM, which returns a 0–1 score. Samples are evaluated with bounded concurrency to respect API rate limits.
Aggregation: For each metric, scores are averaged across all samples for which the metric was computed. The
overallscore is the mean of all metric aggregates. Samples marked asskippedby a metric (e.g. samples withoutgroundTruthforcontextRecall) are excluded from that metric's aggregate, preventing silent score distortion.Checkpoint/resume: Pass
checkpoint: './progress.json'to enable resumable evaluation. If interrupted, re-running the same call will skip already-completed samples and continue from where it left off.