Function evaluate

evaluate(
    options: EvaluateOptions,
): Promise<
    {
        scores: {
            faithfulness?: number;
            contextRelevance?: number;
            answerRelevance?: number;
            contextRecall?: number;
            contextPrecision?: number;
            overall: number;
            [key: string]: unknown;
        };
        samples: {
            id?: string;
            question: string;
            scores: Record<string, number>;
            reasoning?: Record<string, string>;
            tenantId?: string;
            metadata?: Record<string, unknown>;
        }[];
        stats?: Record<
            string,
            { mean: number; min: number; max: number; stddev: number; count: number },
        >;
        meta: {
            totalSamples: number;
            metrics: string[];
            provider: string;
            model: string;
            startedAt: string;
            completedAt: string;
            durationMs: number;
        };
    },
>
Evaluates the quality of a RAG pipeline against a labelled dataset.

Uses the LLM-as-judge pattern: each metric sends a structured prompt to the chosen LLM, which returns a 0–1 score. Samples are evaluated with bounded concurrency to respect API rate limits.

Aggregation: For each metric, scores are averaged across all samples for which the metric was computed. The overall score is the mean of all metric aggregates. Samples marked as skipped by a metric (e.g. samples without groundTruth for contextRecall) are excluded from that metric's aggregate, preventing silent score distortion.

Checkpoint/resume: Pass checkpoint: './progress.json' to enable resumable evaluation. If interrupted, re-running the same call will skip already-completed samples and continue from where it left off.
Parameters
- options: EvaluateOptions
  Evaluation configuration. See EvaluateOptions.
Returns Promise<
    {
        scores: {
            faithfulness?: number;
            contextRelevance?: number;
            answerRelevance?: number;
            contextRecall?: number;
            contextPrecision?: number;
            overall: number;
            [key: string]: unknown;
        };
        samples: {
            id?: string;
            question: string;
            scores: Record<string, number>;
            reasoning?: Record<string, string>;
            tenantId?: string;
            metadata?: Record<string, unknown>;
        }[];
        stats?: Record<
            string,
            { mean: number; min: number; max: number; stddev: number; count: number },
        >;
        meta: {
            totalSamples: number;
            metrics: string[];
            provider: string;
            model: string;
            startedAt: string;
            completedAt: string;
            durationMs: number;
        };
    },
>
A detailed EvaluationResult with per-sample and aggregate scores.
Throws
When thresholds are set and one or more metric aggregates fall below their minimum. The thrown error contains the full result so you can inspect scores even on failure.

Throws
On invalid dataset (empty, wrong shape), unknown provider, or unrecoverable LLM provider errors.
Example
```
import Anthropic from '@anthropic-ai/sdk'
import { evaluate, faithfulness, answerRelevance } from 'rageval'

const results = await evaluate({
  provider: { type: 'anthropic', client: new Anthropic(), model: 'claude-haiku-4-5-20251001' },
  dataset: [
    {
      question: 'What is the capital of France?',
      answer: 'The capital of France is Paris.',
      contexts: ['France is a country in Western Europe. Its capital is Paris.'],
    },
  ],
  metrics: [faithfulness, answerRelevance],
  thresholds: { faithfulness: 0.8 },
})

console.log(results.scores)
// { faithfulness: 0.97, answerRelevance: 0.95, overall: 0.96 }
```
- Defined in src/evaluate.ts:264

Function evaluate

Parameters

Throws

Throws

Example

Settings