Table of Contents
Fetching ...

Dynaboard: An Evaluation-As-A-Service Platform for Holistic Next-Generation Benchmarking

Zhiyi Ma, Kawin Ethayarajh, Tristan Thrush, Somya Jain, Ledell Wu, Robin Jia, Christopher Potts, Adina Williams, Douwe Kiela

TL;DR

Dynaboard introduces an evaluation-as-a-service platform integrated with Dynabench to enable cloud-based, multi-m metric benchmarking with a customizable Dynascore. It emphasizes reproducibility, accessibility, and forward/backwards compatibility, while incorporating prediction costs and a utility-based ranking framework rooted in AMRS. The backend provides standardized, non-cherrypicked evaluations across a shared data and hardware setup, plus metrics for performance, throughput, memory, fairness, and robustness. The frontend offers a dynamic leaderboard where users can tailor metric weights to reflect their utility, and results illustrate how rankings shift when costs like memory and speed are prioritized. Overall, the work advocates for diversified, transparent benchmarks to drive greener, fairer, and more practically useful NLP evaluations as models scale and tasks diversify.

Abstract

We introduce Dynaboard, an evaluation-as-a-service framework for hosting benchmarks and conducting holistic model comparison, integrated with the Dynabench platform. Our platform evaluates NLP models directly instead of relying on self-reported metrics or predictions on a single dataset. Under this paradigm, models are submitted to be evaluated in the cloud, circumventing the issues of reproducibility, accessibility, and backwards compatibility that often hinder benchmarking in NLP. This allows users to interact with uploaded models in real time to assess their quality, and permits the collection of additional metrics such as memory use, throughput, and robustness, which -- despite their importance to practitioners -- have traditionally been absent from leaderboards. On each task, models are ranked according to the Dynascore, a novel utility-based aggregation of these statistics, which users can customize to better reflect their preferences, placing more/less weight on a particular axis of evaluation or dataset. As state-of-the-art NLP models push the limits of traditional benchmarks, Dynaboard offers a standardized solution for a more diverse and comprehensive evaluation of model quality.

Dynaboard: An Evaluation-As-A-Service Platform for Holistic Next-Generation Benchmarking

TL;DR

Dynaboard introduces an evaluation-as-a-service platform integrated with Dynabench to enable cloud-based, multi-m metric benchmarking with a customizable Dynascore. It emphasizes reproducibility, accessibility, and forward/backwards compatibility, while incorporating prediction costs and a utility-based ranking framework rooted in AMRS. The backend provides standardized, non-cherrypicked evaluations across a shared data and hardware setup, plus metrics for performance, throughput, memory, fairness, and robustness. The frontend offers a dynamic leaderboard where users can tailor metric weights to reflect their utility, and results illustrate how rankings shift when costs like memory and speed are prioritized. Overall, the work advocates for diversified, transparent benchmarks to drive greener, fairer, and more practically useful NLP evaluations as models scale and tasks diversify.

Abstract

We introduce Dynaboard, an evaluation-as-a-service framework for hosting benchmarks and conducting holistic model comparison, integrated with the Dynabench platform. Our platform evaluates NLP models directly instead of relying on self-reported metrics or predictions on a single dataset. Under this paradigm, models are submitted to be evaluated in the cloud, circumventing the issues of reproducibility, accessibility, and backwards compatibility that often hinder benchmarking in NLP. This allows users to interact with uploaded models in real time to assess their quality, and permits the collection of additional metrics such as memory use, throughput, and robustness, which -- despite their importance to practitioners -- have traditionally been absent from leaderboards. On each task, models are ranked according to the Dynascore, a novel utility-based aggregation of these statistics, which users can customize to better reflect their preferences, placing more/less weight on a particular axis of evaluation or dataset. As state-of-the-art NLP models push the limits of traditional benchmarks, Dynaboard offers a standardized solution for a more diverse and comprehensive evaluation of model quality.

Paper Structure

This paper contains 41 sections, 2 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Left: The marginal rate of substitution (MRS) is the negative of the slope of an indifference curve (shown in red) with fixed utility $U$, each of which shows the trade-off being made between metric $\texttt{M}$ and performance $\texttt{perf}$. Since model (a) and (b) cannot possibly lie on the same curve -- because the former is better in both respects -- we assume that when all else (including utility) is held constant, the increase in performance from (b) to (a) should come at the expense of the increase in $\texttt{M}$, giving us an estimate of where the higher indifference curve lies. Right: Taking the line-of-best-fit to estimate this trade-off would not work when some models are strictly better than others.
  • Figure 2: Screenshots of the Dynaboard rankings for Sentiment Analysis, under the default weights (above) and custom weights (below). In the default setting, half the weight is placed on accuracy, so DeBERTa, RoBERTa, and T5 rank highest. In the custom setting, the weight is split with throughput and memory, so the memory-heavy T5 is supplanted by FastText.