Table of Contents
Fetching ...

Beyond statistical significance: Quantifying uncertainty and statistical variability in multilingual and multitask NLP evaluation

Jonne Sälevä, Duygu Ataman, Constantine Lignos

TL;DR

The paper tackles uncertainty in multilingual and multitask NLP benchmarks by separating data-side and model-side sources of variance and introducing a resampling-based variance-decomposition framework. It formalizes a three-level hierarchical model with between-language variance and within-language variability, further decomposed into seed and bootstrap components, and demonstrates how to estimate and propagate these uncertainties to rankings and pairwise model differences. Through QA on XQuAD, MT on FLORES-200, and NER on OpenNER, it shows that between-language variation often dominates and that ignoring multiple sources of variability can inflate claims of improvement. The authors provide practical, low-overhead tools to perform uncertainty-aware evaluations, promoting more reliable, generalizable conclusions in multilingual/multitask NLP research.

Abstract

We introduce a set of resampling-based methods for quantifying uncertainty and statistical precision of evaluation metrics in multilingual and/or multitask NLP benchmarks. We show how experimental variation in performance scores arises from both model and data-related sources, and that accounting for both of them is necessary to avoid substantially underestimating the overall variability over hypothetical replications. Using multilingual question answering, machine translation, and named entity recognition as example tasks, we also demonstrate how resampling methods are useful for quantifying the replication uncertainty of various quantities used in leaderboards such as model rankings and pairwise differences between models.

Beyond statistical significance: Quantifying uncertainty and statistical variability in multilingual and multitask NLP evaluation

TL;DR

The paper tackles uncertainty in multilingual and multitask NLP benchmarks by separating data-side and model-side sources of variance and introducing a resampling-based variance-decomposition framework. It formalizes a three-level hierarchical model with between-language variance and within-language variability, further decomposed into seed and bootstrap components, and demonstrates how to estimate and propagate these uncertainties to rankings and pairwise model differences. Through QA on XQuAD, MT on FLORES-200, and NER on OpenNER, it shows that between-language variation often dominates and that ignoring multiple sources of variability can inflate claims of improvement. The authors provide practical, low-overhead tools to perform uncertainty-aware evaluations, promoting more reliable, generalizable conclusions in multilingual/multitask NLP research.

Abstract

We introduce a set of resampling-based methods for quantifying uncertainty and statistical precision of evaluation metrics in multilingual and/or multitask NLP benchmarks. We show how experimental variation in performance scores arises from both model and data-related sources, and that accounting for both of them is necessary to avoid substantially underestimating the overall variability over hypothetical replications. Using multilingual question answering, machine translation, and named entity recognition as example tasks, we also demonstrate how resampling methods are useful for quantifying the replication uncertainty of various quantities used in leaderboards such as model rankings and pairwise differences between models.

Paper Structure

This paper contains 41 sections, 2 equations, 2 figures, 12 tables.

Figures (2)

  • Figure 1: Left: Notional diagram of a $M \times L$ leaderboard consisting of M models tested on L languages and replicated R times each, yielding individual observations $x_{lr}^{(m)}$. Middle: Aggregation over $L$ languages into a single scalar per model and estimating between-language variance $\nu_{m}^2$. Right: Aggregation over replications to estimate within-language uncertainty $\eta_{ml}$.
  • Figure 2: Tree diagram showing the multilevel structure of the experimental data containing $R$ replications of $L$ languages for a single model. Between-language variance $\nu_m^2$ is computed across the averages $\mu^{(m)}_{l}$ whereas within-language variance $\eta_{ml}^2$ corresponds to the variance among the leaf nodes $x_{lr}^{(m)}$ of each subtree.