Table of Contents
Fetching ...

Reliable, Reproducible, and Really Fast Leaderboards with Evalica

Dmitry Ustalov

TL;DR

The paper presents Evalica, an open-source toolkit for reliable, reproducible NLP leaderboards based on pairwise preferences. It combines a Rust core with Python bindings, bootstrapped confidence intervals, and visualizations via Web/CLI/API, enabling a fast, correct, multi-language evaluation workflow. The approach supports pairwise win rates computed as $p_{ij} = \frac{s_i}{s_i+s_j}$ and uses bootstrap procedures to derive 95 percent confidence intervals, with performance benchmarks showing up to tens of times faster runtimes compared to Python implementations. The work aims to accelerate evaluation loops, reduce common errors in leaderboard construction, and guide broader adoption with public packaging, governance, and future plans for more use cases and languages.

Abstract

The rapid advancement of natural language processing (NLP) technologies, such as instruction-tuned large language models (LLMs), urges the development of modern evaluation protocols with human and machine feedback. We introduce Evalica, an open-source toolkit that facilitates the creation of reliable and reproducible model leaderboards. This paper presents its design, evaluates its performance, and demonstrates its usability through its Web interface, command-line interface, and Python API.

Reliable, Reproducible, and Really Fast Leaderboards with Evalica

TL;DR

The paper presents Evalica, an open-source toolkit for reliable, reproducible NLP leaderboards based on pairwise preferences. It combines a Rust core with Python bindings, bootstrapped confidence intervals, and visualizations via Web/CLI/API, enabling a fast, correct, multi-language evaluation workflow. The approach supports pairwise win rates computed as and uses bootstrap procedures to derive 95 percent confidence intervals, with performance benchmarks showing up to tens of times faster runtimes compared to Python implementations. The work aims to accelerate evaluation loops, reduce common errors in leaderboard construction, and guide broader adoption with public packaging, governance, and future plans for more use cases and languages.

Abstract

The rapid advancement of natural language processing (NLP) technologies, such as instruction-tuned large language models (LLMs), urges the development of modern evaluation protocols with human and machine feedback. We introduce Evalica, an open-source toolkit that facilitates the creation of reliable and reproducible model leaderboards. This paper presents its design, evaluates its performance, and demonstrates its usability through its Web interface, command-line interface, and Python API.

Paper Structure

This paper contains 12 sections, 1 equation, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Evalica facilitates the highlighted aspects of leaderboard-making that involve aggregation of judgements, scoring the models with bootstrapped confidence intervals (CIs), and getting the final model ranks.
  • Figure 2: Evalica has a core in Rust that is covered by a comprehensive suite of tests in Python. We simplify prototyping and increase test reliability by keeping an independent implementation of each method in Python.
  • Figure 3: Performance scaling analysis of the Rust implementations in Evalica on the synthetic version of the Chatbot Arena dataset. Both scales are logarithmic. Time is in seconds, dataset size is the number of pairs; a 95% confidence interval is shown for ten runs. Lower is better.
  • Figure 4: An example of computing Elo ranking and the corresponding pairwise win rates with Evalica. Other methods can be applied similarly with a trivial modification: bradley_terry, average_win_rate, etc. See https://github.com/dustalov/evalica/blob/master/Tutorial.ipynb for an executable example.
  • Figure 5: An example of bootstrapping a 95% confidence interval of Bradley:52 scores with Evalica and pandas McKinney:10. Any other supported model can be applied after a trivial modification. For simplicity, we do not show an example with scipy.stats.bootstrapVirtanen:20, yet it is possible. See https://github.com/dustalov/evalica/blob/master/Chatbot-Arena.ipynb for an executable example.
  • ...and 2 more figures