Table of Contents
Fetching ...

Self-Boost via Optimal Retraining: An Analysis via Approximate Message Passing

Adel Javanmard, Rudrajit Das, Alessandro Epasto, Vahab Mirrokni

TL;DR

This work develops a principled AMP-based framework for optimally combining a model's predictions with noisy labels in binary classification under Gaussian mixture and GLM ground truths. It derives Bayes-optimal aggregators g_t for iterative retraining, provides exact state-evolution characterizations, and reveals regimes where retraining helps or hurts depending on initialization. It also offers a practical variant for linear probing with cross-entropy that outperforms baselines in high-noise settings and validates the theory with experiments. Together, these results advance the understanding of self-boost via retraining and offer actionable guidance for robust training under label noise.

Abstract

Retraining a model using its own predictions together with the original, potentially noisy labels is a well-known strategy for improving the model performance. While prior works have demonstrated the benefits of specific heuristic retraining schemes, the question of how to optimally combine the model's predictions and the provided labels remains largely open. This paper addresses this fundamental question for binary classification tasks. We develop a principled framework based on approximate message passing (AMP) to analyze iterative retraining procedures for two ground truth settings: Gaussian mixture model (GMM) and generalized linear model (GLM). Our main contribution is the derivation of the Bayes optimal aggregator function to combine the current model's predictions and the given labels, which when used to retrain the same model, minimizes its prediction error. We also quantify the performance of this optimal retraining strategy over multiple rounds. We complement our theoretical results by proposing a practically usable version of the theoretically-optimal aggregator function for linear probing with the cross-entropy loss, and demonstrate its superiority over baseline methods in the high label noise regime.

Self-Boost via Optimal Retraining: An Analysis via Approximate Message Passing

TL;DR

This work develops a principled AMP-based framework for optimally combining a model's predictions with noisy labels in binary classification under Gaussian mixture and GLM ground truths. It derives Bayes-optimal aggregators g_t for iterative retraining, provides exact state-evolution characterizations, and reveals regimes where retraining helps or hurts depending on initialization. It also offers a practical variant for linear probing with cross-entropy that outperforms baselines in high-noise settings and validates the theory with experiments. Together, these results advance the understanding of self-boost via retraining and offer actionable guidance for robust training under label noise.

Abstract

Retraining a model using its own predictions together with the original, potentially noisy labels is a well-known strategy for improving the model performance. While prior works have demonstrated the benefits of specific heuristic retraining schemes, the question of how to optimally combine the model's predictions and the provided labels remains largely open. This paper addresses this fundamental question for binary classification tasks. We develop a principled framework based on approximate message passing (AMP) to analyze iterative retraining procedures for two ground truth settings: Gaussian mixture model (GMM) and generalized linear model (GLM). Our main contribution is the derivation of the Bayes optimal aggregator function to combine the current model's predictions and the given labels, which when used to retrain the same model, minimizes its prediction error. We also quantify the performance of this optimal retraining strategy over multiple rounds. We complement our theoretical results by proposing a practically usable version of the theoretically-optimal aggregator function for linear probing with the cross-entropy loss, and demonstrate its superiority over baseline methods in the high label noise regime.

Paper Structure

This paper contains 21 sections, 8 theorems, 107 equations, 6 figures, 5 tables.

Key Result

Theorem 3.1

Let $(\boldsymbol{\theta}^t,\bm{y}^t)_{t\ge0}$ be the AMP iterates given by eq:AMP-bteta-eq:AMP-y. Also let $(m_t,\sigma_t)_{t\ge 0}$ be the state evolution recursions given by eq:SE_GMM. Then under Assumption ass:GMM, for any pseudo-Lipschitz function $\psi:\mathbb{R}^2\to\mathbb{R}$ the following where $G \sim {\sf N}(0,1)$, $M \sim \nu_M$ (see second bullet point of Assumption ass:GMM) are ind

Figures (6)

  • Figure 1: Cobweb plot for the state evolution in Theorem \ref{['thm:main']}, with two initializations: (small) $\eta_1 = 0.2$ and (large) $\eta =1$. Here, $\gamma = 1.5$, $p = 0.3$, $\alpha = 2$, $\pi_+ = 0.3$, $\pi_- = 0.7$.
  • Figure 2: Synthetic Experiments: Comparison between different retraining methods. FT and CT respectively denote the full-retraining and the consensus-based retraining without the memory correction terms. Vanilla is the estimator without any retraining. Here $n=1000$, $d = 800$, $\pi_+ = 0.3$, $\pi_-= 0.7$. Dots are the Opt-AMP algorithm and the solid black curve is the state evolution.
  • Figure 3: State evolution curves for Opt-AMP and the 'approximate' full retraining with the memory correction terms. As $\beta$ grows the approximation of full retraining becomes tighter. Here $\alpha = 0.8$, $\pi_+ = 0.3$, $\pi_-= 0.7$.
  • Figure 4: State evolution curves for Opt-AMP and the 'approximate' consensus-based retraining with the memory correction terms. As $\beta$ grows the approximation of full retraining becomes tighter. Here $\alpha = 0.8$, $\pi_+ = 0.3$, $\pi_-= 0.7$.
  • Figure 5: The AMP update mappings for optimal aggregator, full-retraining and consensus-based retraining. Here, $\gamma=1.5$, $\alpha = 2$, $\pi_+=0.3$, $\pi_- = 0.7$.
  • ...and 1 more figures

Theorems & Definitions (8)

  • Theorem 3.1
  • Theorem 3.2
  • Proposition 3.3
  • Lemma 4.1
  • Theorem 4.2
  • Theorem 4.3
  • Theorem G.1
  • Proposition I.1