Table of Contents
Fetching ...

Unbiased Evaluation of Large Language Models from a Causal Perspective

Meilin Chen, Jian Tian, Liang Ma, Di Xie, Weijie Chen, Jiang Zhu

TL;DR

This work addresses benchmark contamination and biases in large-language-model evaluation by introducing the Unbiased Evaluator, a causal evaluation framework that uses Bags Of Atomic Interventions (BOAT) to dynamically perturb input configurations. The authors formalize evaluation bias, decompose it into original, related, and independent components, and show how Agents-as-an-Evaluator suffer from data and model biases. Through a minimal probing task and extensive experiments on ARC-C, MMLU, and GSM8K, the Unbiased Evaluator yields more robust, interpretable assessments and aligns more closely with expert judgments and LiveBench rankings. The approach demonstrates reduced data and model biases, mitigates contamination effects, and scales across model sizes, offering a principled path toward fairer, more transparent LLM evaluation.

Abstract

Benchmark contamination has become a significant concern in the LLM evaluation community. Previous Agents-as-an-Evaluator address this issue by involving agents in the generation of questions. Despite their success, the biases in Agents-as-an-Evaluator methods remain largely unexplored. In this paper, we present a theoretical formulation of evaluation bias, providing valuable insights into designing unbiased evaluation protocols. Furthermore, we identify two type of bias in Agents-as-an-Evaluator through carefully designed probing tasks on a minimal Agents-as-an-Evaluator setup. To address these issues, we propose the Unbiased Evaluator, an evaluation protocol that delivers a more comprehensive, unbiased, and interpretable assessment of LLMs.Extensive experiments reveal significant room for improvement in current LLMs. Additionally, we demonstrate that the Unbiased Evaluator not only offers strong evidence of benchmark contamination but also provides interpretable evaluation results.

Unbiased Evaluation of Large Language Models from a Causal Perspective

TL;DR

This work addresses benchmark contamination and biases in large-language-model evaluation by introducing the Unbiased Evaluator, a causal evaluation framework that uses Bags Of Atomic Interventions (BOAT) to dynamically perturb input configurations. The authors formalize evaluation bias, decompose it into original, related, and independent components, and show how Agents-as-an-Evaluator suffer from data and model biases. Through a minimal probing task and extensive experiments on ARC-C, MMLU, and GSM8K, the Unbiased Evaluator yields more robust, interpretable assessments and aligns more closely with expert judgments and LiveBench rankings. The approach demonstrates reduced data and model biases, mitigates contamination effects, and scales across model sizes, offering a principled path toward fairer, more transparent LLM evaluation.

Abstract

Benchmark contamination has become a significant concern in the LLM evaluation community. Previous Agents-as-an-Evaluator address this issue by involving agents in the generation of questions. Despite their success, the biases in Agents-as-an-Evaluator methods remain largely unexplored. In this paper, we present a theoretical formulation of evaluation bias, providing valuable insights into designing unbiased evaluation protocols. Furthermore, we identify two type of bias in Agents-as-an-Evaluator through carefully designed probing tasks on a minimal Agents-as-an-Evaluator setup. To address these issues, we propose the Unbiased Evaluator, an evaluation protocol that delivers a more comprehensive, unbiased, and interpretable assessment of LLMs.Extensive experiments reveal significant room for improvement in current LLMs. Additionally, we demonstrate that the Unbiased Evaluator not only offers strong evidence of benchmark contamination but also provides interpretable evaluation results.

Paper Structure

This paper contains 33 sections, 13 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: (a) Agents-as-an-Evaluator suffers from data and model bias. (b) Our proposed Unbiased Evaluator dynamically evaluate the LLMs with designed Bags Of Atomic InTerventions (BOAT).
  • Figure 2: Model bias visualizations. Left: $\mathcal{R}_{OC}$ vs. Strength on ARC-C and MMLU datasets. Right: $\mathcal{R}_{UC}$ vs. Strength on ARC-C and MMLU datasets. Strength refers to the probability defined in Equation \ref{['eq_strengh']}. A higher strength value indicates a greater proportion of "processed" samples within the dataset ("process" denotes rephrasing and BOAT in Agents-as-an-Evaluator and Unbiased Evaluator, respectively). The point where strength=0 represents the original datasets. Mistral, Yi, Llama, Qwen represents Mistral-Large-2411, Yi1.5-34B-Chat, Llama3.1-70B-Instruct and Qwen2.5-72B-Instruct, respectively. For Agents-as-an-Evaluator, we observe a significant increase in $\mathcal{R}_{OC}$ with growing strength, while $\mathcal{R}_{UC}$ remains relatively stable, indicating the existence of model bias. Compared with Agents-as-an-Evaluator, our Unbiased Evaluator remains relatively stable on both $\mathcal{R}_{OC}$ and $\mathcal{R}_{UC}$.
  • Figure 3: (a) Traditional evaluation methods rely on static and fixed variables, suffering from contamination issues. (b) Unbiased Evaluator enhance the evaluation process by augmenting these variables through carefully designed Bags Of Atomic InTerventions (BOAT). (c) An example of BOAT. Underlined contents are derived from multiple interventions.
  • Figure 4: The confusion matrix of original benchmarks and Unbiased Evaluator on ARC-C.
  • Figure 5: The performance of Qwen1.5 series and Llama2 series models on ARC-C and MMLU, considering both original benchmarks and Unbiased Evaluator.
  • ...and 1 more figures