Post-Hoc Reversal: Are We Selecting Models Prematurely?

Rishabh Ranjan; Saurabh Garg; Mrigank Raman; Carlos Guestrin; Zachary Lipton

Post-Hoc Reversal: Are We Selecting Models Prematurely?

Rishabh Ranjan, Saurabh Garg, Mrigank Raman, Carlos Guestrin, Zachary Lipton

TL;DR

This paper demonstrates a phenomenon that is called post-hoc reversal, where performance trends are reversed after applying post-hoc transforms and proposes post-hoc selection, a simple technique whereby post-hoc metrics inform model development decisions such as early stopping, checkpointing, and broader hyperparameter choices.

Abstract

Trained models are often composed with post-hoc transforms such as temperature scaling (TS), ensembling and stochastic weight averaging (SWA) to improve performance, robustness, uncertainty estimation, etc. However, such transforms are typically applied only after the base models have already been finalized by standard means. In this paper, we challenge this practice with an extensive empirical study. In particular, we demonstrate a phenomenon that we call post-hoc reversal, where performance trends are reversed after applying post-hoc transforms. This phenomenon is especially prominent in high-noise settings. For example, while base models overfit badly early in training, both ensembling and SWA favor base models trained for more epochs. Post-hoc reversal can also prevent the appearance of double descent and mitigate mismatches between test loss and test error seen in base models. Preliminary analyses suggest that these transforms induce reversal by suppressing the influence of mislabeled examples, exploiting differences in their learning dynamics from those of clean examples. Based on our findings, we propose post-hoc selection, a simple technique whereby post-hoc metrics inform model development decisions such as early stopping, checkpointing, and broader hyperparameter choices. Our experiments span real-world vision, language, tabular and graph datasets. On an LLM instruction tuning dataset, post-hoc selection results in >1.5x MMLU improvement compared to naive selection.

Post-Hoc Reversal: Are We Selecting Models Prematurely?

TL;DR

Abstract

Paper Structure (28 sections, 5 equations, 16 figures, 11 tables)

This paper contains 28 sections, 5 equations, 16 figures, 11 tables.

Introduction
Related Work
Preliminaries and Background
Learning on Noisy Data
Post-Hoc Transforms in Machine Learning
Post-Hoc Reversal: Formalization and Empirical Study
Definitions
Experiments
Epoch-Wise Post-Hoc Reversal
Model-Wise Post-Hoc Reversal
Hyperparameter-Wise Post-Hoc Reversal
Intuitions for Post-Hoc Reversal
Post-Hoc Selection: Leveraging Post-Hoc Reversal in Practice
Experiments Across Domains and Modalities
LLM Instruction Tuning
...and 13 more sections

Figures (16)

Figure 1: An illustration of the phenomenon of post-hoc reversal on the FMoW dataset: base performance at epoch $t_2$ is worse than at epoch $t_1$ ($b_2 > b_1$), but post-hoc performance is better ($p_2 < p_1$). The current practice of naive selection considers base metrics to pick models at epoch $t_1$. Our proposed technique of post-hoc selection instead uses post-hoc metrics to pick models at epoch $t_2$, resulting in $> 2\times$ improvement over naive selection in both test loss and error. SWA+Ens+TS refers to the post-hoc transform obtained by composing SWA, ensemble (Ens) and temperature scaling (TS). Base curves show mean of $8$ runs, models from which constitute the ensembles. Individual runs are shown in lighter colors. See Fig. \ref{['fig:fmow_lrs']} for more detailed curves on this dataset.
Figure 2: A comparison of naive and post-hoc selection on label sets from CIFAR-10/100-N (abbr. C-10/100-N) for the SWA+TS transform. On noisy label sets, post-hoc selection is often $> 2\times$ better.
Figure 3: Loss and error for CIFAR-10-N Clean (approx. $0\%$ noise), Rand1 (approx. $17\%$ noise) and Worst (approx. $40\%$ noise). Except for ensemble curves, mean of $8$ runs is shown; individual runs are in lighter shades. Ensembles comprise models from these $8$ runs. For example, observe post-hoc reversal for C-10-N Worst: (1) error plot: from epoch $5$ to $50$, solid red (base) curve worsens but solid orange (SWA) curve improves; (2) error plot: solid red (base) curve has a double descent but dashed red (ensemble) curve does not; (3) loss plots: solid red (base) curve has a double descent pre-TS but not post-TS; (4) error plot: best error is at approx. epoch $5$ for solid red (base) curve but at approx. epoch $60$ for dashed orange (SWA ensemble) curve.
Figure 4: C-10-N Worst test curves against model size. Best width for solid blue curves is $\sim 10$ but for dashed orange curves, it is $\sim 50$ for error and $\sim 25$ for post-TS loss.
Figure 5: FMoW test curves for $3$ LR schedules. Note that the pre-TS loss is significantly higher than the post-TS loss. For example, observe post-hoc reversal w.r.t. cosine and constant LRs at epoch $50$ between: (1) solid blue (base) and dashed blue (ensemble) error curves; (2) solid blue (base) and solid orange (SWA) post-TS loss curves; (3) solid blue (base) curves for pre-TS and post-TS loss.
...and 11 more figures

Theorems & Definitions (5)

Definition 1: Post-Hoc Transform
Definition 2: Post-hoc reversal
Definition 3: Index-wise post-hoc reversal
Definition 4: Base and post-hoc curves
Definition 5: Post-hoc reversal for curves

Post-Hoc Reversal: Are We Selecting Models Prematurely?

TL;DR

Abstract

Post-Hoc Reversal: Are We Selecting Models Prematurely?

Authors

TL;DR

Abstract

Table of Contents

Figures (16)

Theorems & Definitions (5)