Table of Contents
Fetching ...

Estimating LLM Consistency: A User Baseline vs Surrogate Metrics

Xiaoyuan Wu, Weiran Lin, Omer Akgul, Lujo Bauer

TL;DR

LLMs exhibit hallucinations and prompt sensitivity, motivating a robust, human-grounded notion of consistency. The authors collect a large-scale human baseline (n=2,976) of semantic-similarity judgments across 10 responses for 100 prompts and compare automated metrics to this ground truth. They find that existing automated, sampling- or logit-based metrics do not reliably align with human judgments, though an ensemble of 16 logit-derived scores can match the best-performing metrics while eliminating the need for resampling. The work advocates for incorporating human evaluation and real-world prompts in consistency assessment and demonstrates a cost-effective path (logit-based ensemble) to approximate human-aligned consistency estimates.

Abstract

Large language models (LLMs) are prone to hallucinations and sensitive to prompt perturbations, often resulting in inconsistent or unreliable generated text. Different methods have been proposed to mitigate such hallucinations and fragility, one of which is to measure the consistency of LLM responses -- the model's confidence in the response or likelihood of generating a similar response when resampled. In previous work, measuring LLM response consistency often relied on calculating the probability of a response appearing within a pool of resampled responses, analyzing internal states, or evaluating logits of responses. However, it was not clear how well these approaches approximated users' perceptions of consistency of LLM responses. To find out, we performed a user study ($n=2,976$) demonstrating that current methods for measuring LLM response consistency typically do not align well with humans' perceptions of LLM consistency. We propose a logit-based ensemble method for estimating LLM consistency and show that our method matches the performance of the best-performing existing metric in estimating human ratings of LLM consistency. Our results suggest that methods for estimating LLM consistency without human evaluation are sufficiently imperfect to warrant broader use of evaluation with human input; this would avoid misjudging the adequacy of models because of the imperfections of automated consistency metrics.

Estimating LLM Consistency: A User Baseline vs Surrogate Metrics

TL;DR

LLMs exhibit hallucinations and prompt sensitivity, motivating a robust, human-grounded notion of consistency. The authors collect a large-scale human baseline (n=2,976) of semantic-similarity judgments across 10 responses for 100 prompts and compare automated metrics to this ground truth. They find that existing automated, sampling- or logit-based metrics do not reliably align with human judgments, though an ensemble of 16 logit-derived scores can match the best-performing metrics while eliminating the need for resampling. The work advocates for incorporating human evaluation and real-world prompts in consistency assessment and demonstrates a cost-effective path (logit-based ensemble) to approximate human-aligned consistency estimates.

Abstract

Large language models (LLMs) are prone to hallucinations and sensitive to prompt perturbations, often resulting in inconsistent or unreliable generated text. Different methods have been proposed to mitigate such hallucinations and fragility, one of which is to measure the consistency of LLM responses -- the model's confidence in the response or likelihood of generating a similar response when resampled. In previous work, measuring LLM response consistency often relied on calculating the probability of a response appearing within a pool of resampled responses, analyzing internal states, or evaluating logits of responses. However, it was not clear how well these approaches approximated users' perceptions of consistency of LLM responses. To find out, we performed a user study () demonstrating that current methods for measuring LLM response consistency typically do not align well with humans' perceptions of LLM consistency. We propose a logit-based ensemble method for estimating LLM consistency and show that our method matches the performance of the best-performing existing metric in estimating human ratings of LLM consistency. Our results suggest that methods for estimating LLM consistency without human evaluation are sufficiently imperfect to warrant broader use of evaluation with human input; this would avoid misjudging the adequacy of models because of the imperfections of automated consistency metrics.

Paper Structure

This paper contains 37 sections, 3 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Overview of our study design
  • Figure 2: At response to response-set level, our ensemble of 16 logit-based scores is as close of an approximation of human ratings as USE.
  • Figure 3: At per-prompt level, among existing metrics, our logit-based ensemble method and USE have the highest Spearman correlation coefficient with human evaluation of model-prompt consistency.
  • Figure 4: Instructions provided to participants for comparing pairs of sentences.
  • Figure 5: Running 100 10-fold cross validation shows using an ensemble of all 16 logits-based scores yield the lowest Mean Squared Error and highest Spearman $\rho$ coefficient when compared to human ratings.
  • ...and 1 more figures