rageval - v0.1.1
    Preparing search index...

    Interface EvaluateOptions

    Configuration options for evaluate.

    interface EvaluateOptions {
        provider: ProviderConfig;
        dataset: {
            id?: string;
            question: string;
            answer: string;
            contexts: string[];
            groundTruth?: string;
            tenantId?: string;
            metadata?: Record<string, unknown>;
        }[];
        metrics?: Metric[];
        includeReasoning?: boolean;
        concurrency?: number;
        thresholds?: Partial<Record<string, number>>;
        onProgress?: (completed: number, total: number) => void;
        checkpoint?: string;
    }
    Index

    Properties

    provider: ProviderConfig

    The LLM provider to use as the judge. Pass { type: 'anthropic', client, model }, { type: 'openai', client, model }, or { type: 'azure', client, model }.

    dataset: {
        id?: string;
        question: string;
        answer: string;
        contexts: string[];
        groundTruth?: string;
        tenantId?: string;
        metadata?: Record<string, unknown>;
    }[]

    Array of RAG samples to evaluate. Each sample must have question, answer, and contexts. groundTruth is optional but required for the contextRecall metric. tenantId and metadata are optional and propagate to per-sample results.

    metrics?: Metric[]

    Which metrics to compute. Defaults to all five built-in metrics.

    Available: faithfulness, contextRelevance, answerRelevance, contextRecall, contextPrecision.

    Note: contextRecall requires groundTruth on each sample. Samples without groundTruth are automatically skipped for that metric and excluded from its aggregate score.

    includeReasoning?: boolean

    When true, each metric's LLM reasoning is included in sample results. Useful for debugging unexpected scores.

    false
    
    concurrency?: number

    Maximum number of samples evaluated simultaneously. Higher values are faster but consume more API quota.

    5
    
    thresholds?: Partial<Record<string, number>>

    Minimum acceptable score per metric. If any aggregate score falls below its threshold after evaluation, a ThresholdError is thrown containing the full result.

    This is intended for CI quality gates — use it in combination with process.exit(1) to fail a build when RAG quality regresses.

    thresholds: { faithfulness: 0.8, answerRelevance: 0.75 }
    
    onProgress?: (completed: number, total: number) => void

    Called after each sample completes evaluation. Use for progress bars, logging, or UI updates during large evaluations.

    Type Declaration

      • (completed: number, total: number): void
      • Parameters

        • completed: number

          Number of samples evaluated so far.

        • total: number

          Total number of samples in the dataset.

        Returns void

    onProgress: (done, total) => {
    process.stderr.write(`\r${done}/${total} evaluated`)
    }
    checkpoint?: string

    File path for checkpoint-based resumable evaluation.

    When provided, evaluate() will:

    1. On start — read the checkpoint file if it exists, and skip any samples whose results are already recorded (matched by id if present, otherwise by question text). This lets you resume a large batch that was interrupted.
    2. After each new sample — write the accumulated results (prior + new) to the checkpoint file as JSON so progress is never lost.

    The checkpoint file is a plain JSON file with the shape:

    { "version": 1, "samples": [ ...SampleResult[] ] }
    

    Delete the checkpoint file when you want to start a fresh evaluation.

    // Large 500-sample evaluation — safe to Ctrl+C and restart
    await evaluate({
    provider: { type: 'anthropic', client },
    dataset: largeDataset,
    checkpoint: './eval-progress.json',
    onProgress: (done, total) => process.stderr.write(`\r${done}/${total}`),
    })