Table of Contents
Fetching ...

PENEX: AdaBoost-Inspired Neural Network Regularization

Klaus-Rudolf Kladny, Bernhard Schölkopf, Michael Muehlebach

TL;DR

PENEX introduces a penalized exponential loss for neural networks, reformulating the multi-class exponential loss with a SumExp penalty to create a first-order-optimizable objective. The authors prove Fisher consistency and margin-maximization guarantees, and show that gradient descent on PENEX acts as an implicit AdaBoost-like procedure, effectively parameterizing weak learners. Empirically, PENEX frequently yields stronger regularization and better generalization than common techniques across computer vision and language tasks, especially in low-data and noisy-label scenarios, albeit with some convergence speed trade-offs and limited gains on very large datasets like ImageNet. The work positions PENEX as a practical AdaBoost-inspired regularizer with theoretical foundations and broad applicability to training and fine-tuning deep neural networks.

Abstract

AdaBoost sequentially fits so-called weak learners to minimize an exponential loss, which penalizes mislabeled data points more severely than other loss functions like cross-entropy. Paradoxically, AdaBoost generalizes well in practice as the number of weak learners grows. In the present work, we introduce Penalized Exponential Loss (PENEX), a new formulation of the multi-class exponential loss that is theoretically grounded and, in contrast to the existing formulation, amenable to optimization via first-order methods. We demonstrate both empirically and theoretically that PENEX implicitly maximizes margins of data points. Also, we show that gradient increments on PENEX implicitly parameterize weak learners in the boosting framework. Across computer vision and language tasks, we show that PENEX exhibits a regularizing effect often better than established methods with similar computational cost. Our results highlight PENEX's potential as an AdaBoost-inspired alternative for effective training and fine-tuning of deep neural networks.

PENEX: AdaBoost-Inspired Neural Network Regularization

TL;DR

PENEX introduces a penalized exponential loss for neural networks, reformulating the multi-class exponential loss with a SumExp penalty to create a first-order-optimizable objective. The authors prove Fisher consistency and margin-maximization guarantees, and show that gradient descent on PENEX acts as an implicit AdaBoost-like procedure, effectively parameterizing weak learners. Empirically, PENEX frequently yields stronger regularization and better generalization than common techniques across computer vision and language tasks, especially in low-data and noisy-label scenarios, albeit with some convergence speed trade-offs and limited gains on very large datasets like ImageNet. The work positions PENEX as a practical AdaBoost-inspired regularizer with theoretical foundations and broad applicability to training and fine-tuning deep neural networks.

Abstract

AdaBoost sequentially fits so-called weak learners to minimize an exponential loss, which penalizes mislabeled data points more severely than other loss functions like cross-entropy. Paradoxically, AdaBoost generalizes well in practice as the number of weak learners grows. In the present work, we introduce Penalized Exponential Loss (PENEX), a new formulation of the multi-class exponential loss that is theoretically grounded and, in contrast to the existing formulation, amenable to optimization via first-order methods. We demonstrate both empirically and theoretically that PENEX implicitly maximizes margins of data points. Also, we show that gradient increments on PENEX implicitly parameterize weak learners in the boosting framework. Across computer vision and language tasks, we show that PENEX exhibits a regularizing effect often better than established methods with similar computational cost. Our results highlight PENEX's potential as an AdaBoost-inspired alternative for effective training and fine-tuning of deep neural networks.

Paper Structure

This paper contains 67 sections, 4 theorems, 56 equations, 7 figures, 1 table, 2 algorithms.

Key Result

Proposition 2.1

For any $\rho > 0$ and $\alpha > 0$, the penalized exponential loss eq:PEL is Fisher consistent, meaning that the minimizer of the population equivalentThis means that we exchange $\hat{\mathbb{E}}$ by $\mathbb{E}$ in eq:PEL., called $f_*$, recovers the Bayes-optimal classifier in the sense that i.e., the $f_*^{(y)}(\mathbf{x})$ recover the true logits (up to constant shift) at temperature $(1 +

Figures (7)

  • Figure 1: Gradient Descent on PENEX as a Form of Implicit AdaBoost. AdaBoost (left) builds a strong learner $f_M(\mathbf{x})$ (purple) by sequentially fitting weak learners such as decision stumps (orange) and linearly combining them. Gradient descent itself (right) can be thought of as an implicit form of boosting where weak learners correspond to $\mathbf{J} (\mathbf{x}) \Delta \theta_{m}$ (orange), parameterized by parameter increments $\Delta \theta_{m}$. Combining many gradient descent steps can thus be interpreted as forming a strong learner $f_{\theta_{M}}(\mathbf{x})$ (green) as an approximate linear combination of weak learners. In both cases, each weak learner is obtained by minimizing the exponential loss within a fixed function class $\mathcal{G}$ (more details in \ref{['sec:rel_to_adaboost']}).
  • Figure 2: Comparison of Margins. Neural networks trained with PENEX (center) tend to implicitly maximize margins (here, geometric margins, indicated for an example point in green), in a similar way to support vector machines (right), here trained with a RBF kernel SchSmo02. Training neural networks via cross-entropy loss (left), in contrast, typically leads to smaller margins.
  • Figure 3: CE vs. PENEX. We consider the binary case ($K=2$) with $f^{(2)}(x) \equiv 0$, for a single $x$ and $y=1$. PENEX penalizes errors more than cross-entropy.
  • Figure 4: Performance Analysis on CIFAR-100.Larger means better. Results are computed from validation data. All hyperparameters have been tuned individually. PENEX (thick red) is an effective regularizer with often better generalization than other common regularization techniques (thin), and shows no signs of "overfitting" like cross-entropy training (orange).
  • Figure 5: All Validation Curves.Larger means better. Validation curves over $200$ epochs for all experiments, similar to \ref{['fig:metric_curves']}.
  • ...and 2 more figures

Theorems & Definitions (7)

  • Proposition 2.1: Fisher Consistency
  • Theorem 2.1: Margin Maximization
  • Proposition 2.2: Optimal Penalty Parameter
  • Proposition 2.3
  • proof
  • proof
  • proof