Beyond statistical significance: Quantifying uncertainty and statistical variability in multilingual and multitask NLP evaluation
Jonne Sälevä, Duygu Ataman, Constantine Lignos
TL;DR
The paper tackles uncertainty in multilingual and multitask NLP benchmarks by separating data-side and model-side sources of variance and introducing a resampling-based variance-decomposition framework. It formalizes a three-level hierarchical model with between-language variance and within-language variability, further decomposed into seed and bootstrap components, and demonstrates how to estimate and propagate these uncertainties to rankings and pairwise model differences. Through QA on XQuAD, MT on FLORES-200, and NER on OpenNER, it shows that between-language variation often dominates and that ignoring multiple sources of variability can inflate claims of improvement. The authors provide practical, low-overhead tools to perform uncertainty-aware evaluations, promoting more reliable, generalizable conclusions in multilingual/multitask NLP research.
Abstract
We introduce a set of resampling-based methods for quantifying uncertainty and statistical precision of evaluation metrics in multilingual and/or multitask NLP benchmarks. We show how experimental variation in performance scores arises from both model and data-related sources, and that accounting for both of them is necessary to avoid substantially underestimating the overall variability over hypothetical replications. Using multilingual question answering, machine translation, and named entity recognition as example tasks, we also demonstrate how resampling methods are useful for quantifying the replication uncertainty of various quantities used in leaderboards such as model rankings and pairwise differences between models.
