Table of Contents
Fetching ...

Why Comparing Single Performance Scores Does Not Allow to Draw Conclusions About Machine Learning Approaches

Nils Reimers, Iryna Gurevych

TL;DR

The paper demonstrates that single-score evaluations with significance tests can mislead when comparing learning approaches due to random variation, especially for non-deterministic models. It formalizes evaluation methods based on score distributions and shows that best-run or single-run comparisons can falsely suggest superiority. Through NLP sequence-tagging experiments, it quantifies development/test variance and advocates two distribution-focused definitions of superiority, supported by empirical results. The authors urge reporting score distributions and using multiple seeds in shared tasks to achieve robust, generalizable conclusions.

Abstract

Developing state-of-the-art approaches for specific tasks is a major driving force in our research community. Depending on the prestige of the task, publishing it can come along with a lot of visibility. The question arises how reliable are our evaluation methodologies to compare approaches? One common methodology to identify the state-of-the-art is to partition data into a train, a development and a test set. Researchers can train and tune their approach on some part of the dataset and then select the model that worked best on the development set for a final evaluation on unseen test data. Test scores from different approaches are compared, and performance differences are tested for statistical significance. In this publication, we show that there is a high risk that a statistical significance in this type of evaluation is not due to a superior learning approach. Instead, there is a high risk that the difference is due to chance. For example for the CoNLL 2003 NER dataset we observed in up to 26% of the cases type I errors (false positives) with a threshold of p < 0.05, i.e., falsely concluding a statistically significant difference between two identical approaches. We prove that this evaluation setup is unsuitable to compare learning approaches. We formalize alternative evaluation setups based on score distributions.

Why Comparing Single Performance Scores Does Not Allow to Draw Conclusions About Machine Learning Approaches

TL;DR

The paper demonstrates that single-score evaluations with significance tests can mislead when comparing learning approaches due to random variation, especially for non-deterministic models. It formalizes evaluation methods based on score distributions and shows that best-run or single-run comparisons can falsely suggest superiority. Through NLP sequence-tagging experiments, it quantifies development/test variance and advocates two distribution-focused definitions of superiority, supported by empirical results. The authors urge reporting score distributions and using multiple seeds in shared tasks to achieve robust, generalizable conclusions.

Abstract

Developing state-of-the-art approaches for specific tasks is a major driving force in our research community. Depending on the prestige of the task, publishing it can come along with a lot of visibility. The question arises how reliable are our evaluation methodologies to compare approaches? One common methodology to identify the state-of-the-art is to partition data into a train, a development and a test set. Researchers can train and tune their approach on some part of the dataset and then select the model that worked best on the development set for a final evaluation on unseen test data. Test scores from different approaches are compared, and performance differences are tested for statistical significance. In this publication, we show that there is a high risk that a statistical significance in this type of evaluation is not due to a superior learning approach. Instead, there is a high risk that the difference is due to chance. For example for the CoNLL 2003 NER dataset we observed in up to 26% of the cases type I errors (false positives) with a threshold of p < 0.05, i.e., falsely concluding a statistically significant difference between two identical approaches. We prove that this evaluation setup is unsuitable to compare learning approaches. We formalize alternative evaluation setups based on score distributions.

Paper Structure

This paper contains 19 sections, 15 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Common evaluation methodology to compare two approaches for a specific task.
  • Figure 2: Single score comparison for non-deterministic learning approaches (Evaluation 1).
  • Figure 3: Illustration of model tuning and comparing the best models $A_*$ and $B_*$ (Evaluation 2).
  • Figure 4: Ratio of statistically significant differences between $A_*$ and $\tilde{A}_*$ for different $n$-values.