Aggregate scores averaged across all samples.
Per-sample detailed results.
Optionalstats?: Record<Per-metric score distribution statistics (min, max, stddev, count).
Keys are metric names (same as keys in scores, minus overall).
Useful for understanding score variance and identifying which questions
score poorly. overall is excluded — compute it from individual metric stats.
Metadata about the evaluation run.
Total number of samples evaluated.
Names of the metrics that were evaluated.
LLM provider used (e.g. 'anthropic', 'openai').
LLM model used (e.g. 'claude-opus-4-6').
ISO 8601 timestamp when evaluation started.
ISO 8601 timestamp when evaluation completed.
Wall-clock duration of the evaluation in milliseconds.
ReadonlyfailuresMap of metric names to their actual score and required minimum. Only metrics that failed the threshold are included.
Iterate with Object.entries(e.failures) to get [metric, { score, threshold }] pairs.
ReadonlyresultThe complete EvaluationResult that triggered this error.
All per-sample scores and aggregate scores are present — only the threshold gate failed. Use this to export reports (SARIF, JUnit, HTML, Markdown) even when the quality gate fails, so you can diagnose exactly which samples caused the regression.
Aggregate scores averaged across all samples.
Per-sample detailed results.
Optionalstats?: Record<Per-metric score distribution statistics (min, max, stddev, count).
Keys are metric names (same as keys in scores, minus overall).
Useful for understanding score variance and identifying which questions
score poorly. overall is excluded — compute it from individual metric stats.
Metadata about the evaluation run.
Total number of samples evaluated.
Names of the metrics that were evaluated.
LLM provider used (e.g. 'anthropic', 'openai').
LLM model used (e.g. 'claude-opus-4-6').
ISO 8601 timestamp when evaluation started.
ISO 8601 timestamp when evaluation completed.
Wall-clock duration of the evaluation in milliseconds.
StaticstackThe Error.stackTraceLimit property specifies the number of stack frames
collected by a stack trace (whether generated by new Error().stack or
Error.captureStackTrace(obj)).
The default value is 10 but may be set to any valid JavaScript number. Changes
will affect any stack trace captured after the value has been changed.
If set to a non-number value, or set to a negative number, stack traces will not capture any frames.
OptionalcauseOptionalstackStaticcaptureCreates a .stack property on targetObject, which when accessed returns
a string representing the location in the code at which
Error.captureStackTrace() was called.
const myObject = {};
Error.captureStackTrace(myObject);
myObject.stack; // Similar to `new Error().stack`
The first line of the trace will be prefixed with
${myObject.name}: ${myObject.message}.
The optional constructorOpt argument accepts a function. If given, all frames
above constructorOpt, including constructorOpt, will be omitted from the
generated stack trace.
The constructorOpt argument is useful for hiding implementation
details of error generation from the user. For instance:
function a() {
b();
}
function b() {
c();
}
function c() {
// Create an error without stack trace to avoid calculating the stack trace twice.
const { stackTraceLimit } = Error;
Error.stackTraceLimit = 0;
const error = new Error();
Error.stackTraceLimit = stackTraceLimit;
// Capture the stack trace above function b
Error.captureStackTrace(error, b); // Neither function c, nor b is included in the stack trace
throw error;
}
a();
OptionalconstructorOpt: FunctionStaticprepare
Thrown by evaluate when one or more metric aggregate scores fall below their configured ScoreThresholds.
Carries both the failing metric details (failures) and the full EvaluationResult (result) so you can export SARIF, JUnit, or HTML reports even when the quality gate fails.
Use this in CI pipelines to fail a build when RAG quality regresses:
Example