Table of Contents
Fetching ...

Show Your Work: Improved Reporting of Experimental Results

Jesse Dodge, Suchin Gururangan, Dallas Card, Roy Schwartz, Noah A. Smith

TL;DR

The paper tackles the problem that test-set scores alone can mislead model comparisons in NLP due to differing computational budgets. It introduces a budget-aware evaluation framework that reports the expected validation performance of the best model as a function of the budget, along with a closed-form estimator for this expectation and variance. Through case studies on SST, contextual representations, SciTail, and SQuAD, it shows that the preferred method changes with compute and that past results may be unreachable with stated budgets. The work provides practical recommendations and a checklist to improve reproducibility, enabling fairer and more informative comparisons across NLP research.

Abstract

Research in natural language processing proceeds, in part, by demonstrating that new models achieve superior performance (e.g., accuracy) on held-out test data, compared to previous results. In this paper, we demonstrate that test-set performance scores alone are insufficient for drawing accurate conclusions about which model performs best. We argue for reporting additional details, especially performance on validation data obtained during model development. We present a novel technique for doing so: expected validation performance of the best-found model as a function of computation budget (i.e., the number of hyperparameter search trials or the overall training time). Using our approach, we find multiple recent model comparisons where authors would have reached a different conclusion if they had used more (or less) computation. Our approach also allows us to estimate the amount of computation required to obtain a given accuracy; applying it to several recently published results yields massive variation across papers, from hours to weeks. We conclude with a set of best practices for reporting experimental results which allow for robust future comparisons, and provide code to allow researchers to use our technique.

Show Your Work: Improved Reporting of Experimental Results

TL;DR

The paper tackles the problem that test-set scores alone can mislead model comparisons in NLP due to differing computational budgets. It introduces a budget-aware evaluation framework that reports the expected validation performance of the best model as a function of the budget, along with a closed-form estimator for this expectation and variance. Through case studies on SST, contextual representations, SciTail, and SQuAD, it shows that the preferred method changes with compute and that past results may be unreachable with stated budgets. The work provides practical recommendations and a checklist to improve reproducibility, enabling fairer and more informative comparisons across NLP research.

Abstract

Research in natural language processing proceeds, in part, by demonstrating that new models achieve superior performance (e.g., accuracy) on held-out test data, compared to previous results. In this paper, we demonstrate that test-set performance scores alone are insufficient for drawing accurate conclusions about which model performs best. We argue for reporting additional details, especially performance on validation data obtained during model development. We present a novel technique for doing so: expected validation performance of the best-found model as a function of computation budget (i.e., the number of hyperparameter search trials or the overall training time). Using our approach, we find multiple recent model comparisons where authors would have reached a different conclusion if they had used more (or less) computation. Our approach also allows us to estimate the amount of computation required to obtain a given accuracy; applying it to several recently published results yields massive variation across papers, from hours to weeks. We conclude with a set of best practices for reporting experimental results which allow for robust future comparisons, and provide code to allow researchers to use our technique.

Paper Structure

This paper contains 28 sections, 8 equations, 4 figures, 11 tables.

Figures (4)

  • Figure 1: Current practice when comparing NLP models is to train multiple instantiations of each, choose the best model of each type based on validation performance, and compare their performance on test data (inner box). Under this setup, (assuming test-set results are similar to validation), one would conclude from the results above (hyperparameter search for two models on the 5-way SST classification task) that the CNN outperforms Logistic Regression (LR). In our proposed evaluation framework, we instead encourage practitioners to consider the expected validation accuracy ($y$-axis; shading shows $\pm 1$ standard deviation), as a function of budget ($x$-axis). Each point on a curve is the expected value of the best validation accuracy obtained ($y$) after evaluating $x$ random hyperparameter values. Note that (1) the better performing model depends on the computational budget; LR has higher expected performance for budgets up to 10 hyperparameter assignments, while the CNN is better for larger budgets. (2) Given a model and desired accuracy (e.g., 0.395 for CNN), we can estimate the expected budget required to reach it (16; dotted lines).
  • Figure 2: Expected maximum performance of a BCN classifier on SST. We compare three embedding approaches (GloVe embeddings, GloVe + frozen ELMo, and GloVe + fine-tuned ELMo). The $x$-axis is time, on a log scale. We omit the variance for visual clarity. For each of the three model families, we sampled 50 hyperparameter values, and plot the expected maximum performance with the $x$-axis values scaled by the average training duration. The plot shows that for each approach (GloVe, ELMo frozen, and ELMo fine-tuned), there exists a budget for which it is preferable.
  • Figure 3: Comparing reported accuracies (dashed lines) on SciTail to expected validation performance under varying levels of compute (solid lines). The estimated budget required for expected performance to match the reported result differs substantially across models, and the relative ordering varies with budget. We omit variance for visual clarity.
  • Figure 4: Comparing reported development exact-match score of BIDAF (dashed line) on SQuAD to expected performance of the best model with varying computational budgets (solid line). The shaded area represents the expected performance $\pm 1$ standard deviation, within the observed range of values. It takes about 18 days (55 hyperparameter trials) for the expected performance to match the reported results.