Table of Contents
Fetching ...

ExplainaBoard: An Explainable Leaderboard for NLP

Pengfei Liu, Jinlan Fu, Yang Xiao, Weizhe Yuan, Shuaicheng Chang, Junqi Dai, Yixin Liu, Zihuiwen Ye, Zi-Yi Dou, Graham Neubig

TL;DR

ExplainaBoard addresses the one-dimensional nature of NLP leaderboards by introducing an interpretable, interactive, and reliable evaluation platform for NLP models. It augments standard leaderboards with diagnostic tools for single-system and pairwise analysis, data-bias inspection, error analysis, and system-combination, plus confidence and calibration assessments. The framework is deployed across 12 NLP tasks, 50 datasets, and 400 models, with a practical case study on CoNLL-2003 NER demonstrating ensemble gains. The work aims to shift leaderboard culture toward output-driven research and deeper understanding of model behaviors beyond holistic accuracy.

Abstract

With the rapid development of NLP research, leaderboards have emerged as one tool to track the performance of various systems on various NLP tasks. They are effective in this goal to some extent, but generally present a rather simplistic one-dimensional view of the submitted systems, communicated only through holistic accuracy numbers. In this paper, we present a new conceptualization and implementation of NLP evaluation: the ExplainaBoard, which in addition to inheriting the functionality of the standard leaderboard, also allows researchers to (i) diagnose strengths and weaknesses of a single system (e.g.~what is the best-performing system bad at?) (ii) interpret relationships between multiple systems. (e.g.~where does system A outperform system B? What if we combine systems A, B, and C?) and (iii) examine prediction results closely (e.g.~what are common errors made by multiple systems, or in what contexts do particular errors occur?). So far, ExplainaBoard covers more than 400 systems, 50 datasets, 40 languages, and 12 tasks. ExplainaBoard keeps updated and is recently upgraded by supporting (1) multilingual multi-task benchmark, (2) meta-evaluation, and (3) more complicated task: machine translation, which reviewers also suggested.} We not only released an online platform on the website \url{http://explainaboard.nlpedia.ai/} but also make our evaluation tool an API with MIT Licence at Github \url{https://github.com/neulab/explainaBoard} and PyPi \url{https://pypi.org/project/interpret-eval/} that allows users to conveniently assess their models offline. We additionally release all output files from systems that we have run or collected to motivate "output-driven" research in the future.

ExplainaBoard: An Explainable Leaderboard for NLP

TL;DR

ExplainaBoard addresses the one-dimensional nature of NLP leaderboards by introducing an interpretable, interactive, and reliable evaluation platform for NLP models. It augments standard leaderboards with diagnostic tools for single-system and pairwise analysis, data-bias inspection, error analysis, and system-combination, plus confidence and calibration assessments. The framework is deployed across 12 NLP tasks, 50 datasets, and 400 models, with a practical case study on CoNLL-2003 NER demonstrating ensemble gains. The work aims to shift leaderboard culture toward output-driven research and deeper understanding of model behaviors beyond holistic accuracy.

Abstract

With the rapid development of NLP research, leaderboards have emerged as one tool to track the performance of various systems on various NLP tasks. They are effective in this goal to some extent, but generally present a rather simplistic one-dimensional view of the submitted systems, communicated only through holistic accuracy numbers. In this paper, we present a new conceptualization and implementation of NLP evaluation: the ExplainaBoard, which in addition to inheriting the functionality of the standard leaderboard, also allows researchers to (i) diagnose strengths and weaknesses of a single system (e.g.~what is the best-performing system bad at?) (ii) interpret relationships between multiple systems. (e.g.~where does system A outperform system B? What if we combine systems A, B, and C?) and (iii) examine prediction results closely (e.g.~what are common errors made by multiple systems, or in what contexts do particular errors occur?). So far, ExplainaBoard covers more than 400 systems, 50 datasets, 40 languages, and 12 tasks. ExplainaBoard keeps updated and is recently upgraded by supporting (1) multilingual multi-task benchmark, (2) meta-evaluation, and (3) more complicated task: machine translation, which reviewers also suggested.} We not only released an online platform on the website \url{http://explainaboard.nlpedia.ai/} but also make our evaluation tool an API with MIT Licence at Github \url{https://github.com/neulab/explainaBoard} and PyPi \url{https://pypi.org/project/interpret-eval/} that allows users to conveniently assess their models offline. We additionally release all output files from systems that we have run or collected to motivate "output-driven" research in the future.

Paper Structure

This paper contains 31 sections, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Illustration of the ExplainaBoard concept. Compared to vanilla leaderboards, ExplainaBoard allows users to perform interpretable (single-system , pairwise analysis, data bias), interactive (system combination, fine-grained/common error analysis), and reliable analysis (confidence interval, calibration) on systems in which they are interested. "Comb." denotes "combination" and "Errs" represents "errors". "PER, LOC, ORG" refer to different labels.
  • Figure 2: An example of the actual ExplainaBoard interface for NER over three top-performing systems on the CoNLL-2003 dataset. Box A shows the single-system analysis results obtained when users select the top-1 system and click the "Single Analysis" button. Box B shows the pairwise analysis results when top-2 systems are chosen and "Pair Analysis" is clicked. Users can click any bin of the histogram, which results in a fine-grained error case table. Box C represents a table with common errors of these top-3 systems. Box D illustrates the combined result of the top-3 systems.