rageval - v0.1.1
    Preparing search index...

    Function toCsv

    • Serializes an EvaluationResult to a CSV string.

      Each row represents one sample. Columns:

      • id -- sample identifier (empty string if not set)
      • question -- the question text
      • one numeric score column per evaluated metric (e.g. faithfulness, answerRelevance)
      • overall -- per-sample mean of all metric scores
      • {metric}_reasoning columns -- included automatically when includeReasoning: true was passed to evaluate() and reasoning text is present. Useful for audit logs in healthcare, legal, or compliance contexts.

      Scores are formatted to 4 decimal places. The CSV follows RFC 4180 escaping -- fields containing commas, double quotes, or newlines are wrapped in double quotes with internal quotes doubled. Safe to open in Excel, Google Sheets, or pandas.

      Parameters

      • result: {
            scores: {
                faithfulness?: number;
                contextRelevance?: number;
                answerRelevance?: number;
                contextRecall?: number;
                contextPrecision?: number;
                overall: number;
                [key: string]: unknown;
            };
            samples: {
                id?: string;
                question: string;
                scores: Record<string, number>;
                reasoning?: Record<string, string>;
                tenantId?: string;
                metadata?: Record<string, unknown>;
            }[];
            stats?: Record<
                string,
                { mean: number; min: number; max: number; stddev: number; count: number },
            >;
            meta: {
                totalSamples: number;
                metrics: string[];
                provider: string;
                model: string;
                startedAt: string;
                completedAt: string;
                durationMs: number;
            };
        }

        The evaluation result to export.

        • scores: {
              faithfulness?: number;
              contextRelevance?: number;
              answerRelevance?: number;
              contextRecall?: number;
              contextPrecision?: number;
              overall: number;
              [key: string]: unknown;
          }

          Aggregate scores averaged across all samples.

        • samples: {
              id?: string;
              question: string;
              scores: Record<string, number>;
              reasoning?: Record<string, string>;
              tenantId?: string;
              metadata?: Record<string, unknown>;
          }[]

          Per-sample detailed results.

        • Optionalstats?: Record<
              string,
              { mean: number; min: number; max: number; stddev: number; count: number },
          >

          Per-metric score distribution statistics (min, max, stddev, count).

          Keys are metric names (same as keys in scores, minus overall). Useful for understanding score variance and identifying which questions score poorly. overall is excluded — compute it from individual metric stats.

          const { stats } = await evaluate({ ... })
          // High stddev indicates inconsistent pipeline behaviour:
          if ((stats.faithfulness?.stddev ?? 0) > 0.15) {
          console.warn('Faithfulness varies widely across samples — review your retrieval.')
          }
        • meta: {
              totalSamples: number;
              metrics: string[];
              provider: string;
              model: string;
              startedAt: string;
              completedAt: string;
              durationMs: number;
          }

          Metadata about the evaluation run.

          • totalSamples: number

            Total number of samples evaluated.

          • metrics: string[]

            Names of the metrics that were evaluated.

          • provider: string

            LLM provider used (e.g. 'anthropic', 'openai').

          • model: string

            LLM model used (e.g. 'claude-opus-4-6').

          • startedAt: string

            ISO 8601 timestamp when evaluation started.

          • completedAt: string

            ISO 8601 timestamp when evaluation completed.

          • durationMs: number

            Wall-clock duration of the evaluation in milliseconds.

      Returns string

      CSV string with header row. Returns empty string if dataset is empty.

      import { evaluate, toCsv } from 'rageval'
      import { writeFileSync } from 'node:fs'

      const result = await evaluate({ ... })
      writeFileSync('eval-scores.csv', toCsv(result))
      // Columns: id,question,faithfulness,answerRelevance,overall
      // Example row: q1,What is...,0.9500,0.8750,0.9125

      // With reasoning (for audit logs):
      const resultWithReasoning = await evaluate({ ..., includeReasoning: true })
      writeFileSync('eval-audit.csv', toCsv(resultWithReasoning))
      // Columns: id,question,faithfulness,answerRelevance,overall,faithfulness_reasoning,answerRelevance_reasoning