The evaluation result from evaluate().
Aggregate scores averaged across all samples.
Per-sample detailed results.
Optionalstats?: Record<Per-metric score distribution statistics (min, max, stddev, count).
Keys are metric names (same as keys in scores, minus overall).
Useful for understanding score variance and identifying which questions
score poorly. overall is excluded — compute it from individual metric stats.
Metadata about the evaluation run.
Total number of samples evaluated.
Names of the metrics that were evaluated.
LLM provider used (e.g. 'anthropic', 'openai').
LLM model used (e.g. 'claude-opus-4-6').
ISO 8601 timestamp when evaluation started.
ISO 8601 timestamp when evaluation completed.
Wall-clock duration of the evaluation in milliseconds.
Optional display configuration.
Prints a formatted evaluation report to the terminal.
Renders color-coded score bars, metric summaries, and optional per-sample breakdown. Colors are automatically disabled when stdout is not a TTY.