Table of Contents
Fetching ...

Evaluation of Large Language Models via Coupled Token Generation

Nina Corvelo Benz, Stratis Tsirtsis, Eleni Straitouri, Ivi Chatzi, Ander Artola Velasco, Suhas Thejaswi, Manuel Gomez-Rodriguez

Abstract

State of the art large language models rely on randomization to respond to a prompt. As an immediate consequence, a model may respond differently to the same prompt if asked multiple times. In this work, we argue that the evaluation and ranking of large language models should control for the randomization underpinning their functioning. Our starting point is the development of a causal model for coupled autoregressive generation, which allows different large language models to sample responses with the same source of randomness. Building upon our causal model, we first show that, on evaluations based on benchmark datasets, coupled autoregressive generation leads to the same conclusions as vanilla autoregressive generation but using provably fewer samples. However, we further show that, on evaluations based on (human) pairwise comparisons, coupled and vanilla autoregressive generation can surprisingly lead to different rankings when comparing more than two models, even with an infinite amount of samples. This suggests that the apparent advantage of a model over others in existing evaluation protocols may not be genuine but rather confounded by the randomness inherent to the generation process. To illustrate and complement our theoretical results, we conduct experiments with several large language models from the Llama, Mistral and Qwen families. We find that, across multiple benchmark datasets, coupled autoregressive generation requires up to 75% fewer samples to reach the same conclusions as vanilla autoregressive generation. Further, we find that the win-rates derived from pairwise comparisons by a strong large language model to prompts from the LMSYS Chatbot Arena platform differ under coupled and vanilla autoregressive generation.

Evaluation of Large Language Models via Coupled Token Generation

Abstract

State of the art large language models rely on randomization to respond to a prompt. As an immediate consequence, a model may respond differently to the same prompt if asked multiple times. In this work, we argue that the evaluation and ranking of large language models should control for the randomization underpinning their functioning. Our starting point is the development of a causal model for coupled autoregressive generation, which allows different large language models to sample responses with the same source of randomness. Building upon our causal model, we first show that, on evaluations based on benchmark datasets, coupled autoregressive generation leads to the same conclusions as vanilla autoregressive generation but using provably fewer samples. However, we further show that, on evaluations based on (human) pairwise comparisons, coupled and vanilla autoregressive generation can surprisingly lead to different rankings when comparing more than two models, even with an infinite amount of samples. This suggests that the apparent advantage of a model over others in existing evaluation protocols may not be genuine but rather confounded by the randomness inherent to the generation process. To illustrate and complement our theoretical results, we conduct experiments with several large language models from the Llama, Mistral and Qwen families. We find that, across multiple benchmark datasets, coupled autoregressive generation requires up to 75% fewer samples to reach the same conclusions as vanilla autoregressive generation. Further, we find that the win-rates derived from pairwise comparisons by a strong large language model to prompts from the LMSYS Chatbot Arena platform differ under coupled and vanilla autoregressive generation.

Paper Structure

This paper contains 33 sections, 5 theorems, 57 equations, 35 figures, 9 tables.

Key Result

Proposition 1

For any $m, m' \in \mathcal{M}$, it holds that

Figures (35)

  • Figure 1: Example of coupled autoregressive generation for Llama 1B and Llama 8B. Boxes represent endogenous random variables and circles represent exogenous random variables. The value of each endogenous variable is given by a function of the values of its ancestors in the causal graph, as defined by Eq. \ref{['eq:SCM']}. The value of the coupled noise variable $U_1$ (purple) is sampled independently from a given distribution $P_U$, and it determines the stochastic state of the samplers used by both Llama 1B and Llama 8B during the generation of token $T_1$.
  • Figure 2: Comparison between 1B and 3B on questions from the knowledge area "college computer science" of the MMLU dataset. Panel (a) shows the kernel density estimate (KDE) of the covariance between the scores of the two LLMs on each question under coupled generation; the dashed line corresponds to the average value. Panel (b) shows the KDE of the variance of the difference between the scores of the LLMs on each question under coupled and independent generation; the highlighted point corresponds to the median value. Panel (c) shows the absolute error in the estimation of the expected difference between the scores of the LLMs against the number of samples; for each point on the x-axis, we perform $1{,}000$ sub-samplings and shaded areas correspond to $95\%$ confidence intervals.
  • Figure 3: Evaluation of six LLMs using pairwise comparisons on questions from the LMSYS-Chat-1M dataset. We estimate the empirical win-rate of each LLM against each other using pairwise comparisons between the outputs to $500$ questions with $10$ (different) random seeds under both coupled and independent generation. Panel (a) shows the empirical win-rate of $\texttt{bnb-8bit}$ against all other LLMs, where error bars correspond to $95\%$ confidence intervals. Here, for each pair of empirical win-rates under coupled and independent generation, we conduct a two-tailed z-test, to test the null hypothesis that the empirical win-rates are the same; (⁎⁎ ⁎⁎, ⁎⁎ ⁎) indicate $p$-values ($<0.0001$, $< 0.001$). We present qualitatively similar results for other LLMs in Appendix \ref{['app:lmsys']}. Panel (b) shows the average win-rate of each LLM across all other LLMs ($\pm$$95\%$ confidence intervals). To derive the rankings, for each LLM, we choose the lowest ranking provided by the method of chatzi2024prediction.
  • Figure 4: Comparison between four pairs of LLMs in the Llama family on multiple-choice questions from the "college computer science" knowledge area of the MMLU dataset. Panels in column (a) show the kernel density estimate (KDE) of the covariance between the scores of the two LLMs on each question under coupled generation; the dashed lines correspond to average values. Panels in column (b) show the KDE of the variance of the difference between the scores of the LLMs on each question under coupled and independent generation; the highlighted points correspond to median values. Panels in column (c) show the absolute error in the estimation of the expected difference between the scores of the LLMs against the number of samples; for each point on the x-axis, we perform $1{,}000$ sub-samplings and shaded areas correspond to $95\%$ confidence intervals.
  • Figure 5: Comparison between four pairs of LLMs in the Llama family on multiple-choice questions from the "college computer science" knowledge area of the MMLU dataset. Panels in column (a) show the kernel density estimate (KDE) of the covariance between the scores of the two LLMs on each question under coupled generation; the dashed lines correspond to average values. Panels in column (b) show the KDE of the variance of the difference between the scores of the LLMs on each question under coupled and independent generation; the highlighted points correspond to median values. Panels in column (c) show the absolute error in the estimation of the expected difference between the scores of the LLMs against the number of samples; for each point on the x-axis, we perform $1{,}000$ sub-samplings and shaded areas correspond to $95\%$ confidence intervals.
  • ...and 30 more figures

Theorems & Definitions (6)

  • Proposition 1
  • Proposition 2
  • Proposition 3
  • Proposition 4
  • Definition 1
  • Proposition 5