The evaluation result from evaluate().
Aggregate scores averaged across all samples.
Per-sample detailed results.
Optionalstats?: Record<Per-metric score distribution statistics (min, max, stddev, count).
Keys are metric names (same as keys in scores, minus overall).
Useful for understanding score variance and identifying which questions
score poorly. overall is excluded — compute it from individual metric stats.
Metadata about the evaluation run.
Total number of samples evaluated.
Names of the metrics that were evaluated.
LLM provider used (e.g. 'anthropic', 'openai').
LLM model used (e.g. 'claude-opus-4-6').
ISO 8601 timestamp when evaluation started.
ISO 8601 timestamp when evaluation completed.
Wall-clock duration of the evaluation in milliseconds.
Optional report title shown in the header.
Self-contained HTML string.
Generates a self-contained, zero-dependency HTML evaluation report.
The returned string is a complete HTML document that can be saved to a
.htmlfile and opened in any browser — no server, no internet required.