The LLM provider to use as the judge.
Pass { type: 'anthropic', client, model }, { type: 'openai', client, model },
or { type: 'azure', client, model }.
Array of RAG samples to evaluate.
Each sample must have question, answer, and contexts.
groundTruth is optional but required for the contextRecall metric.
tenantId and metadata are optional and propagate to per-sample results.
OptionalmetricsWhich metrics to compute. Defaults to all five built-in metrics.
Available: faithfulness, contextRelevance, answerRelevance,
contextRecall, contextPrecision.
Note: contextRecall requires groundTruth on each sample.
Samples without groundTruth are automatically skipped for that metric
and excluded from its aggregate score.
OptionalincludeWhen true, each metric's LLM reasoning is included in sample results.
Useful for debugging unexpected scores.
OptionalconcurrencyMaximum number of samples evaluated simultaneously. Higher values are faster but consume more API quota.
OptionalthresholdsMinimum acceptable score per metric. If any aggregate score falls below its threshold after evaluation, a ThresholdError is thrown containing the full result.
This is intended for CI quality gates — use it in combination with
process.exit(1) to fail a build when RAG quality regresses.
OptionalonCalled after each sample completes evaluation. Use for progress bars, logging, or UI updates during large evaluations.
Number of samples evaluated so far.
Total number of samples in the dataset.
OptionalcheckpointFile path for checkpoint-based resumable evaluation.
When provided, evaluate() will:
id if present, otherwise
by question text). This lets you resume a large batch that was interrupted.The checkpoint file is a plain JSON file with the shape:
{ "version": 1, "samples": [ ...SampleResult[] ] }
Delete the checkpoint file when you want to start a fresh evaluation.
Configuration options for evaluate.