Reliable, Reproducible, and Really Fast Leaderboards with Evalica

Dmitry Ustalov

Reliable, Reproducible, and Really Fast Leaderboards with Evalica

Dmitry Ustalov

TL;DR

The paper presents Evalica, an open-source toolkit for reliable, reproducible NLP leaderboards based on pairwise preferences. It combines a Rust core with Python bindings, bootstrapped confidence intervals, and visualizations via Web/CLI/API, enabling a fast, correct, multi-language evaluation workflow. The approach supports pairwise win rates computed as $p_{ij} = \frac{s_i}{s_i+s_j}$ and uses bootstrap procedures to derive 95 percent confidence intervals, with performance benchmarks showing up to tens of times faster runtimes compared to Python implementations. The work aims to accelerate evaluation loops, reduce common errors in leaderboard construction, and guide broader adoption with public packaging, governance, and future plans for more use cases and languages.

Abstract

The rapid advancement of natural language processing (NLP) technologies, such as instruction-tuned large language models (LLMs), urges the development of modern evaluation protocols with human and machine feedback. We introduce Evalica, an open-source toolkit that facilitates the creation of reliable and reproducible model leaderboards. This paper presents its design, evaluates its performance, and demonstrates its usability through its Web interface, command-line interface, and Python API.

Reliable, Reproducible, and Really Fast Leaderboards with Evalica

TL;DR

Abstract

Reliable, Reproducible, and Really Fast Leaderboards with Evalica

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (7)