The evaluation result from evaluate().
Aggregate scores averaged across all samples.
Per-sample detailed results.
Optionalstats?: Record<Per-metric score distribution statistics (min, max, stddev, count).
Keys are metric names (same as keys in scores, minus overall).
Useful for understanding score variance and identifying which questions
score poorly. overall is excluded — compute it from individual metric stats.
Metadata about the evaluation run.
Total number of samples evaluated.
Names of the metrics that were evaluated.
LLM provider used (e.g. 'anthropic', 'openai').
LLM model used (e.g. 'claude-opus-4-6').
ISO 8601 timestamp when evaluation started.
ISO 8601 timestamp when evaluation completed.
Wall-clock duration of the evaluation in milliseconds.
Score below which a sample is flagged. Default: 0.6.
SARIF 2.1.0 JSON string.
Serializes an EvaluationResult to SARIF 2.1.0 format.
SARIF (Static Analysis Results Interchange Format) is the standard used by GitHub Advanced Security, Azure DevOps, and other code-quality tools. Upload the SARIF file to GitHub to see evaluation failures as code-scanning alerts on your pull requests -- directly in the diff.
Each sample that scores below
failureThresholdon any metric becomes a SARIF "result" with severity "warning" (score < threshold) or "error" (score < 0.4). Samples that pass all thresholds produce no SARIF results.