The metric score for this sample, in the range [0.0, 1.0]. Higher is always better. Scores are clamped to [0, 1] even if the LLM returns values outside that range.
OptionalreasoningThe LLM judge's explanation of why it assigned this score.
Only populated when includeReasoning: true is passed to evaluate().
Useful for debugging unexpectedly low or high scores.
OptionalskippedWhen true, this metric could not be computed for this sample and should
be excluded from all aggregates. The most common cause is contextRecall
being evaluated on a sample without a groundTruth field.
evaluate() detects skipped: true and omits the score from both
per-sample scores and the per-metric aggregate -- it is never counted
as a 0. This prevents silent score distortion.
The score field is still set to 0 for backward compatibility with code
that reads raw MetricOutput without checking skipped.
Output from a single metric evaluation on one sample.
Returned by every metric's
score()method and collected byevaluate()into the final EvaluationResult.