Table of Contents
Fetching ...

When +1% Is Not Enough: A Paired Bootstrap Protocol for Evaluating Small Improvements

Wenzhang Du

TL;DR

The paper tackles the instability of small ML gains by proposing a paired evaluation framework that uses per-seed deltas, bias-corrected and accelerated (BCa) bootstrap confidence intervals, and a sign-flip permutation test to decide significance under tight compute budgets. It formalizes a paired multi-seed design and a strict decision rule, and demonstrates that common practices like single runs or unpaired $t$-tests frequently overstate evidence for small improvements across CIFAR-10, CIFAR-10N, and AG News. Across synthetic S0/S1/S2 scenarios, the protocol reliably avoids false positives and remains conservative with $k=3$, providing a guardrail against over-claiming while offering practical guidance for reporting uncertainty. The work argues for adopting cautious, statistically grounded evaluation as a standard in reporting small gains, especially when resources limit the number of seeds and the magnitude of improvements.

Abstract

Recent machine learning papers often report 1-2 percentage point improvements from a single run on a benchmark. These gains are highly sensitive to random seeds, data ordering, and implementation details, yet are rarely accompanied by uncertainty estimates or significance tests. It is therefore unclear when a reported +1-2% reflects a real algorithmic advance versus noise. We revisit this problem under realistic compute budgets, where only a few runs are affordable. We propose a simple, PC-friendly evaluation protocol based on paired multi-seed runs, bias-corrected and accelerated (BCa) bootstrap confidence intervals, and a sign-flip permutation test on per-seed deltas. The protocol is intentionally conservative and is meant as a guardrail against over-claiming. We instantiate it on CIFAR-10, CIFAR-10N, and AG News using synthetic no-improvement, small-gain, and medium-gain scenarios. Single runs and unpaired t-tests often suggest significant gains for 0.6-2.0 point improvements, especially on text. With only three seeds, our paired protocol never declares significance in these settings. We argue that such conservative evaluation is a safer default for small gains under tight budgets.

When +1% Is Not Enough: A Paired Bootstrap Protocol for Evaluating Small Improvements

TL;DR

The paper tackles the instability of small ML gains by proposing a paired evaluation framework that uses per-seed deltas, bias-corrected and accelerated (BCa) bootstrap confidence intervals, and a sign-flip permutation test to decide significance under tight compute budgets. It formalizes a paired multi-seed design and a strict decision rule, and demonstrates that common practices like single runs or unpaired -tests frequently overstate evidence for small improvements across CIFAR-10, CIFAR-10N, and AG News. Across synthetic S0/S1/S2 scenarios, the protocol reliably avoids false positives and remains conservative with , providing a guardrail against over-claiming while offering practical guidance for reporting uncertainty. The work argues for adopting cautious, statistically grounded evaluation as a standard in reporting small gains, especially when resources limit the number of seeds and the magnitude of improvements.

Abstract

Recent machine learning papers often report 1-2 percentage point improvements from a single run on a benchmark. These gains are highly sensitive to random seeds, data ordering, and implementation details, yet are rarely accompanied by uncertainty estimates or significance tests. It is therefore unclear when a reported +1-2% reflects a real algorithmic advance versus noise. We revisit this problem under realistic compute budgets, where only a few runs are affordable. We propose a simple, PC-friendly evaluation protocol based on paired multi-seed runs, bias-corrected and accelerated (BCa) bootstrap confidence intervals, and a sign-flip permutation test on per-seed deltas. The protocol is intentionally conservative and is meant as a guardrail against over-claiming. We instantiate it on CIFAR-10, CIFAR-10N, and AG News using synthetic no-improvement, small-gain, and medium-gain scenarios. Single runs and unpaired t-tests often suggest significant gains for 0.6-2.0 point improvements, especially on text. With only three seeds, our paired protocol never declares significance in these settings. We argue that such conservative evaluation is a safer default for small gains under tight budgets.

Paper Structure

This paper contains 30 sections, 2 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Effect sizes and BCa 95% confidence intervals under the paired protocol. Bars show $\Delta_{\text{paired}}$; whiskers show BCa intervals; the dashed line marks zero.
  • Figure 2: Comparison of p-values across scenarios: unpaired $t$-test (x-axis) vs. paired permutation (y-axis). Colours denote datasets; markers denote S1/S2. Dashed lines mark the 0.05 threshold.
  • Figure 3: CIFAR-10 S1 learning curves: test accuracy vs. epoch for three seeds, baseline (old) vs. variant (new).