League of LLMs: A Benchmark-Free Paradigm for Mutual Evaluation of Large Language Models
Qianhong Guo, Wei Xie, Xiaofang Cai, Enze Wang, Shuoyoucheng Ma, Xiaobing Sun, Tian Xia, Kai Chen, Xiaofeng Wang, Baosheng Wang
TL;DR
This work introduces League of LLMs (LOL), a benchmark-free framework that mutual-evaluates multiple LLMs through dynamic, question-generation rounds and decentralized scoring to address data contamination, opacity, and subjective biases in traditional benchmarks. By replacing fixed datasets with self-generated tasks and leveraging Borda-based math scoring alongside absolute scores for programming, LOL achieves stable internal rankings (Top-$k$ consistency $= 70.7\%$) and reveals nuanced insights such as memorization-based answering and developer-family homophily biases. Across mathematics and programming, eight LLMs are compared, yielding discriminative capability rankings that correlate with established benchmarks ($ ho$ values ranging from ~0.71 to ~0.93). The approach demonstrates that mutual evaluation with reference answers can produce reliable, professional, and transparent assessments, and the authors provide public code to extend evaluation to new models and domains.
Abstract
Although large language models (LLMs) have shown exceptional capabilities across a wide range of tasks, reliable evaluation remains a critical challenge due to data contamination, opaque operation, and subjective preferences. To address these issues, we propose League of LLMs (LOL), a novel benchmark-free evaluation paradigm that organizes multiple LLMs into a self-governed league for multi-round mutual evaluation. LOL integrates four core criteria (dynamic, transparent, objective, and professional) to mitigate key limitations of existing paradigms. Experiments on eight mainstream LLMs in mathematics and programming demonstrate that LOL can effectively distinguish LLM capabilities while maintaining high internal ranking stability (Top-$k$ consistency $= 70.7\%$). Beyond ranking, LOL reveals empirical findings that are difficult for traditional paradigms to capture. For instance, ``memorization-based answering'' behaviors are observed in some models, and a statistically significant homophily bias is found within the OpenAI family ($Δ= 9$, $p < 0.05$). Finally, we make our framework and code publicly available as a valuable complement to the current LLM evaluation ecosystem.
