Table of Contents
Fetching ...

UniCBE: An Uniformity-driven Comparing Based Evaluation Framework with Unified Multi-Objective Optimization

Peiwen Yuan, Shaoxiong Feng, Yiwei Li, Xinglin Wang, Yueqi Zhang, Jiayi Shi, Chuyi Tan, Boyuan Pan, Yao Hu, Kan Li

TL;DR

UniCBE tackles the inefficiency of current comparing-based evaluation (CBE) for large language models by introducing a unified, uniformity-driven framework that optimizes three core objectives simultaneously: accuracy, convergence, and scalability. It achieves this by constructing and integrating three decoupled sampling probability matrices—targeting uniformity across tuple combinations, win-rate uncertainty among model pairs, and model-wise allocation for new entrants—and by adopting a greedy tuple sampling strategy with Bradley-Terry preference aggregation. Empirical results on AlpacaEval show that UniCBE saves over 17% of evaluation budget while attaining a Pearson correlation with ground truth exceeding 0.995, and it scales to scenarios where new models are introduced with over 50% budget savings. The work demonstrates that balancing sampling bias, uncertainty descent, and updating uncertainty yields substantial practical gains for dynamic, iterative model evaluation in real-world AI systems.

Abstract

Human preference plays a significant role in measuring large language models and guiding them to align with human values. Unfortunately, current comparing-based evaluation (CBE) methods typically focus on a single optimization objective, failing to effectively utilize scarce yet valuable preference signals. To address this, we delve into key factors that can enhance the accuracy, convergence, and scalability of CBE: suppressing sampling bias, balancing descending process of uncertainty, and mitigating updating uncertainty. Following the derived guidelines, we propose UniCBE, a unified uniformity-driven CBE framework which simultaneously optimize these core objectives by constructing and integrating three decoupled sampling probability matrices, each designed to ensure uniformity in specific aspects. We further ablate the optimal tuple sampling and preference aggregation strategies to achieve efficient CBE. On the AlpacaEval benchmark, UniCBE saves over 17% of evaluation budgets while achieving a Pearson correlation with ground truth exceeding 0.995, demonstrating excellent accuracy and convergence. In scenarios where new models are continuously introduced, UniCBE can even save over 50% of evaluation costs, highlighting its improved scalability.

UniCBE: An Uniformity-driven Comparing Based Evaluation Framework with Unified Multi-Objective Optimization

TL;DR

UniCBE tackles the inefficiency of current comparing-based evaluation (CBE) for large language models by introducing a unified, uniformity-driven framework that optimizes three core objectives simultaneously: accuracy, convergence, and scalability. It achieves this by constructing and integrating three decoupled sampling probability matrices—targeting uniformity across tuple combinations, win-rate uncertainty among model pairs, and model-wise allocation for new entrants—and by adopting a greedy tuple sampling strategy with Bradley-Terry preference aggregation. Empirical results on AlpacaEval show that UniCBE saves over 17% of evaluation budget while attaining a Pearson correlation with ground truth exceeding 0.995, and it scales to scenarios where new models are introduced with over 50% budget savings. The work demonstrates that balancing sampling bias, uncertainty descent, and updating uncertainty yields substantial practical gains for dynamic, iterative model evaluation in real-world AI systems.

Abstract

Human preference plays a significant role in measuring large language models and guiding them to align with human values. Unfortunately, current comparing-based evaluation (CBE) methods typically focus on a single optimization objective, failing to effectively utilize scarce yet valuable preference signals. To address this, we delve into key factors that can enhance the accuracy, convergence, and scalability of CBE: suppressing sampling bias, balancing descending process of uncertainty, and mitigating updating uncertainty. Following the derived guidelines, we propose UniCBE, a unified uniformity-driven CBE framework which simultaneously optimize these core objectives by constructing and integrating three decoupled sampling probability matrices, each designed to ensure uniformity in specific aspects. We further ablate the optimal tuple sampling and preference aggregation strategies to achieve efficient CBE. On the AlpacaEval benchmark, UniCBE saves over 17% of evaluation budgets while achieving a Pearson correlation with ground truth exceeding 0.995, demonstrating excellent accuracy and convergence. In scenarios where new models are continuously introduced, UniCBE can even save over 50% of evaluation costs, highlighting its improved scalability.

Paper Structure

This paper contains 46 sections, 33 equations, 14 figures, 3 tables.

Figures (14)

  • Figure 1: Flowchart of the process for comparing-based evaluation.
  • Figure 2: Analyses of potential sampling bias risks in CBE.
  • Figure 3: Results of compared CBE methods with GPT-4o as the judge on AlpacaEval benchmark. The X-axis (applicable to all plots below) represents the preference budget ($k$). $\mathbf{\Delta}$ denotes the mean absolute error of the estimated win rate. $\mathbf{r_s}$ and $\mathbf{r_p}$ denote the Spearman and Pearson correlations between the the estimated model scores and the ground truth respectively.
  • Figure 4: Results of compared CBE methods in the scenario where new model are consistently introduced every 2000 iterations.
  • Figure 5: Ablation studies of UniCBE with GPT-4o as the judge on AlpacaEval benchmark.
  • ...and 9 more figures