Table of Contents
Fetching ...

Show Your Work with Confidence: Confidence Bands for Tuning Curves

Nicholas Lourie, Kyunghyun Cho, He He

TL;DR

The paper tackles the challenge of fairly comparing NLP models when hyperparameters are tuned, by introducing exact, simultaneous, distribution-free confidence bands for tuning curves. It derives a principled method that bounds one-round score CDFs and translates those bounds to the tuning-curve distributions for any budget, enabling reliable mean or median tuning-curve comparisons. The authors validate exact coverage, show superiority over bootstrap approaches, and provide extensive ablations, demonstrating that median tuning curves with LD bands yield robust, interpretable comparisons. They release opda, a practical library to compute these bands, promoting more reproducible and cost-aware model evaluation in NLP and related fields.

Abstract

The choice of hyperparameters greatly impacts performance in natural language processing. Often, it is hard to tell if a method is better than another or just better tuned. Tuning curves fix this ambiguity by accounting for tuning effort. Specifically, they plot validation performance as a function of the number of hyperparameter choices tried so far. While several estimators exist for these curves, it is common to use point estimates, which we show fail silently and give contradictory results when given too little data. Beyond point estimates, confidence bands are necessary to rigorously establish the relationship between different approaches. We present the first method to construct valid confidence bands for tuning curves. The bands are exact, simultaneous, and distribution-free, thus they provide a robust basis for comparing methods. Empirical analysis shows that while bootstrap confidence bands, which serve as a baseline, fail to approximate their target confidence, ours achieve it exactly. We validate our design with ablations, analyze the effect of sample size, and provide guidance on comparing models with our method. To promote confident comparisons in future work, we release opda: an easy-to-use library that you can install with pip. https://github.com/nicholaslourie/opda

Show Your Work with Confidence: Confidence Bands for Tuning Curves

TL;DR

The paper tackles the challenge of fairly comparing NLP models when hyperparameters are tuned, by introducing exact, simultaneous, distribution-free confidence bands for tuning curves. It derives a principled method that bounds one-round score CDFs and translates those bounds to the tuning-curve distributions for any budget, enabling reliable mean or median tuning-curve comparisons. The authors validate exact coverage, show superiority over bootstrap approaches, and provide extensive ablations, demonstrating that median tuning curves with LD bands yield robust, interpretable comparisons. They release opda, a practical library to compute these bands, promoting more reproducible and cost-aware model evaluation in NLP and related fields.

Abstract

The choice of hyperparameters greatly impacts performance in natural language processing. Often, it is hard to tell if a method is better than another or just better tuned. Tuning curves fix this ambiguity by accounting for tuning effort. Specifically, they plot validation performance as a function of the number of hyperparameter choices tried so far. While several estimators exist for these curves, it is common to use point estimates, which we show fail silently and give contradictory results when given too little data. Beyond point estimates, confidence bands are necessary to rigorously establish the relationship between different approaches. We present the first method to construct valid confidence bands for tuning curves. The bands are exact, simultaneous, and distribution-free, thus they provide a robust basis for comparing methods. Empirical analysis shows that while bootstrap confidence bands, which serve as a baseline, fail to approximate their target confidence, ours achieve it exactly. We validate our design with ablations, analyze the effect of sample size, and provide guidance on comparing models with our method. To promote confident comparisons in future work, we release opda: an easy-to-use library that you can install with pip. https://github.com/nicholaslourie/opda
Paper Structure (36 sections, 2 theorems, 28 equations, 20 figures)

This paper contains 36 sections, 2 theorems, 28 equations, 20 figures.

Key Result

Proposition 1

If $\forall y, \widehat{F}^l(y) \leq F(y) \leq \widehat{F}^u(y)$ with probability $1-\alpha$, then with probability $1-\alpha$, $\forall k, \hat{\tau}_m^l(k) \leq \tau_m(k) \leq \hat{\tau}_m^u(k)$.

Figures (20)

  • Figure 1: The charts compare tuning curves for MLP and LSTM text classifiers on Reuters apte-etal-1994-automated, based on data from tang-etal-2020-showing. The tuning curve plots the F1 score of the best model after each round of random search. The left compares point estimates for the mean based on U and V-statistics; the right compares 50% confidence bands for the median tuning curve. The top and the bottom run the same analysis on different samples of 25 search iterations. The point estimates give contradictory conclusions without warning on different samples, disagreeing whether the LSTM ever beats the MLP. The differences between estimators are small in comparison to this sample variation. Confidence bands, in contrast, directly show the variation due to sampling.
  • Figure 2: The median tuning curve for DeBERTaV3 on MultiNLI (matched), based on 48 search iterations. The point estimate plots the empirical CDF's tuning curve.
  • Figure 3: Median tuning curves on MultiNLI (matched), with 80% confidence based on 48 search iterations. Point estimates plot the empirical CDFs' tuning curves.
  • Figure 4: Median tuning curves on MultiNLI (matched), with 80% confidence based on 48 search iterations. To assess hyperparameter importance, the curves compare tuning epochs (1-4) against leaving it at the default (3).
  • Figure 5: The pointwise coverage of 95% bootstrap confidence bands constructed from 50 search iterations. The graphs show the coverage at each point of the tuning curve, measured in simulation. The shaded regions are 95% Clopper-Pearson confidence intervals.
  • ...and 15 more figures

Theorems & Definitions (4)

  • Proposition 1
  • proof
  • Proposition 2
  • proof