Table of Contents
Fetching ...

Arena-Lite: Efficient and Reliable Large Language Model Evaluation via Tournament-Based Direct Comparisons

Seonil Son, Ju-Min Oh, Heegon Jin, Cheolhun Jang, Jeongbeom Jeong, Kuntae Kim

TL;DR

Arena-Lite addresses the challenge of reliably ranking Large Language Models (LLMs) with fewer evaluations by replacing baseline-mediated comparisons with direct head-to-head, per-prompt tournaments. By aggregating results across multiple randomized tournaments and applying a Bradley-Terry rating from match outcomes, Arena-Lite achieves rankings that align more closely with human-ground-truth benchmarks than traditional baseline-based methods. The authors validate the approach through both a controlled stochastic modeling experiment and a comprehensive empirical study using real LLM judges, demonstrating improved reliability even with smaller datasets or weaker judges. The work provides an open-source web demo and code, enabling researchers and industry practitioners to adopt efficient, reliable LLM evaluation in diverse research and deployment contexts.

Abstract

As Large Language Models (LLMs) expand across domains, LLM judges have become essential for systems evaluation. Current benchmarks typically compare system outputs against baselines. This baseline-mediated approach, though convenient, yields lower reliability than direct comparison between systems. We propose Arena-Lite which integrates tournament structure on top of head-to-head comparison. The application of a tournament structure and direct comparison eliminates the need for baseline outputs, reduces the number of required comparisons, and allows higher reliability in system rankings. We conducted two experiments: (1) controlled stochastic modeling and (2) empirical validation with a real LLM judge. Those experiments collectively demonstrate that Arena-Lite consistently achieves higher reliability with fewer comparisons, even with smaller datasets or weaker judges. We release an easy-to-use web demonstration and code to foster adoption of Arena-Lite, streamlining model selection across research and industry communities. Arena-Lite demo and code are available on \href{https://huggingface.co/spaces/NCSOFT/ArenaLite}{https://huggingface.co/spaces/NCSOFT/ArenaLite}

Arena-Lite: Efficient and Reliable Large Language Model Evaluation via Tournament-Based Direct Comparisons

TL;DR

Arena-Lite addresses the challenge of reliably ranking Large Language Models (LLMs) with fewer evaluations by replacing baseline-mediated comparisons with direct head-to-head, per-prompt tournaments. By aggregating results across multiple randomized tournaments and applying a Bradley-Terry rating from match outcomes, Arena-Lite achieves rankings that align more closely with human-ground-truth benchmarks than traditional baseline-based methods. The authors validate the approach through both a controlled stochastic modeling experiment and a comprehensive empirical study using real LLM judges, demonstrating improved reliability even with smaller datasets or weaker judges. The work provides an open-source web demo and code, enabling researchers and industry practitioners to adopt efficient, reliable LLM evaluation in diverse research and deployment contexts.

Abstract

As Large Language Models (LLMs) expand across domains, LLM judges have become essential for systems evaluation. Current benchmarks typically compare system outputs against baselines. This baseline-mediated approach, though convenient, yields lower reliability than direct comparison between systems. We propose Arena-Lite which integrates tournament structure on top of head-to-head comparison. The application of a tournament structure and direct comparison eliminates the need for baseline outputs, reduces the number of required comparisons, and allows higher reliability in system rankings. We conducted two experiments: (1) controlled stochastic modeling and (2) empirical validation with a real LLM judge. Those experiments collectively demonstrate that Arena-Lite consistently achieves higher reliability with fewer comparisons, even with smaller datasets or weaker judges. We release an easy-to-use web demonstration and code to foster adoption of Arena-Lite, streamlining model selection across research and industry communities. Arena-Lite demo and code are available on \href{https://huggingface.co/spaces/NCSOFT/ArenaLite}{https://huggingface.co/spaces/NCSOFT/ArenaLite}

Paper Structure

This paper contains 37 sections, 3 equations, 10 figures, 6 tables, 2 algorithms.

Figures (10)

  • Figure 1: Arena-Lite directly compares LLM response pairs over multiple single-elimination tournaments rather than comparing responses to baseline outputs. In terms of deciding whether a certain LLM is better or worse compared to the other one, we suggest direct head-to-head comparison is more intuitive and results in better separability.
  • Figure 2: Comparison of LLM ranking reliability between Arena-Lite and a baseline method in a stochastic simulation (Experiment 1, Sec. \ref{['sec:exp1']}). Ranking reliability is measured by the Spearman correlation ($\uparrow$) between the competition-derived ranking and the ground-truth ranking. Each box plot summarizes the results from 50 trials. The subplots analyze the effect of varying (from left to right) the number of competing models ($n_\text{models}$), the number of prompts ($n_\text{prompts}$), and the accuracy of the judge ($P_\text{judge}$). The single-elimination structure of Arena-Lite results in consistently higher correlation scores.
  • Figure 3: Ranking reliability of Arena-Lite vs. utilizing baseline outputs. Arena-Lite consistently demonstrates higher Spearman's rank correlation across numbers of benchmark prompts ($|X|$), indicating more reliable ranking. The evaluation was performed using gpt-4o (left) and gpt-4o-mini (right) as judge models, with a fixed number of models ($n_\text{models}$=19). Each box plot summarizes the results of 50 runs. (Experiment 2, Sec. 4.3).
  • Figure 4: Arena-Lite web screenshot 1: At the top of the result page, one can see the leaderboard of LLMs with their BT preference. If the benchmark dataset has subcategories, radar chart (right) is also visible.
  • Figure 5: Arena-Lite web screenshot 2: User can walk through the matches and tournaments one by one. Match brackets is visualized briefly with text UI and user can select any specific match to see the details (e.g. match result, prompt, and model outputs).
  • ...and 5 more figures