Reliable, Reproducible, and Really Fast Leaderboards with Evalica
Dmitry Ustalov
TL;DR
The paper presents Evalica, an open-source toolkit for reliable, reproducible NLP leaderboards based on pairwise preferences. It combines a Rust core with Python bindings, bootstrapped confidence intervals, and visualizations via Web/CLI/API, enabling a fast, correct, multi-language evaluation workflow. The approach supports pairwise win rates computed as $p_{ij} = \frac{s_i}{s_i+s_j}$ and uses bootstrap procedures to derive 95 percent confidence intervals, with performance benchmarks showing up to tens of times faster runtimes compared to Python implementations. The work aims to accelerate evaluation loops, reduce common errors in leaderboard construction, and guide broader adoption with public packaging, governance, and future plans for more use cases and languages.
Abstract
The rapid advancement of natural language processing (NLP) technologies, such as instruction-tuned large language models (LLMs), urges the development of modern evaluation protocols with human and machine feedback. We introduce Evalica, an open-source toolkit that facilitates the creation of reliable and reproducible model leaderboards. This paper presents its design, evaluates its performance, and demonstrates its usability through its Web interface, command-line interface, and Python API.
