EnsLoss: Stochastic Calibrated Loss Ensembles for Preventing Overfitting in Classification

Ben Dai

EnsLoss: Stochastic Calibrated Loss Ensembles for Preventing Overfitting in Classification

Ben Dai

TL;DR

A novel ensemble method, namely EnsLoss, which extends the ensemble learning concept to combine loss functions within the ERM framework, and first transforms the CC conditions of losses into loss-derivatives, thereby bypassing the need for explicit loss functions and directly generating calibrated loss-derivatives.

Abstract

Empirical risk minimization (ERM) with a computationally feasible surrogate loss is a widely accepted approach for classification. Notably, the convexity and calibration (CC) properties of a loss function ensure consistency of ERM in maximizing accuracy, thereby offering a wide range of options for surrogate losses. In this article, we propose a novel ensemble method, namely EnsLoss, which extends the ensemble learning concept to combine loss functions within the ERM framework. A key feature of our method is the consideration on preserving the "legitimacy" of the combined losses, i.e., ensuring the CC properties. Specifically, we first transform the CC conditions of losses into loss-derivatives, thereby bypassing the need for explicit loss functions and directly generating calibrated loss-derivatives. Therefore, inspired by Dropout, EnsLoss enables loss ensembles through one training process with doubly stochastic gradient descent (i.e., random batch samples and random calibrated loss-derivatives). We theoretically establish the statistical consistency of our approach and provide insights into its benefits. The numerical effectiveness of EnsLoss compared to fixed loss methods is demonstrated through experiments on a broad range of 14 OpenML tabular datasets and 46 image datasets with various deep learning architectures. Python repository and source code are available on GitHub at https://github.com/statmlben/ensloss.

EnsLoss: Stochastic Calibrated Loss Ensembles for Preventing Overfitting in Classification

TL;DR

Abstract

Paper Structure (31 sections, 10 theorems, 37 equations, 5 figures, 7 tables, 2 algorithms)

This paper contains 31 sections, 10 theorems, 37 equations, 5 figures, 7 tables, 2 algorithms.

Introduction
Overfitting.
Related works
Dropout.
Penalization methods.
Classification-calibration.
Post loss ensembles.
Loss Meta-learn.
Calibrated loss ensembles via doubly stochastic gradients
Calibrated loss-derivative
Superlinear raising-tail
Statistical behavior and consistency of loss ensembles
Experiments
Datasets, models and losses
Tabular datasets.
...and 16 more sections

Key Result

Theorem 2

Let $\phi$ be convex. Then $\phi$ is classification-calibrated if and only if it is differentiable at 0 and $\phi'(0) < 0$.

Figures (5)

Figure 1: Comparison of epoch-vs-test_accuracy curves for various models on CIFAR2 (cat-dog) dataset using EnsLoss (ours) and other fixed losses (logistic, hinge, and exponential losses). The training accuracy curves are omitted, as they have largely stabilized at 1 after few epochs. The pattern shown in the figure, where EnsLoss consistently outperforms the fixed losses across epochs, is a phenomenon observed in almost all CIFAR10 label-pairs and the PCam dataset, as well as with different scales of ResNet, MobileNet, and VGG architectures.
Figure 2: The overall motivation behind generating valid loss-derivatives in our algorithm: first transform the loss conditions (left) into loss-derivative conditions (middle), thereby bypassing the loss and directly generating random loss-derivatives in SGD-based algorithms (right).
Figure 3: Left. Plot of several existing loss functions. Right. Corresponding loss-gradients when $z>1$. Conclusion. Lemma \ref{['lem:loss_bb']} essentially indicates that the right tail of the loss-derivatives needs to rise rapidly from $\phi'(0) < 0$ towards zero, either surpassing zero (as in the case of squared loss) or vanishing faster than $1/z$ when $z$ is large (ignoring the logarithm).
Figure 4: The overall pattern of performance (Accuracy) of EnsLoss against all other fixed loss methods in 45 CIFAR2 binary classification datasets (provided by pairwise labels subset of CIFAR10), based on VGG16, is illustrated. The $x$-axis represents label-paired binary CIFAR datasets, where, for example, CIFAR35 corresponds to the CIFAR2 (cat-dog) dataset.
Figure 5: The training curves for neural networks (epochs vs. training accuracy) under different loss functions on datasets (replicated five times). Left. A MLP network with five hidden layers was trained on a simulated dataset, where all loss functions were proved to be calibrated. Right. A ResNet18 model was trained on the CIFAR (cat and dog) dataset. Conclusion. The losses (even with classification-calibration) fails to meet the condition of superlinear raising-tail, leading to instability in SGD training.

Theorems & Definitions (11)

Definition 1: bartlett2006convexity
Theorem 2: zhang2004multiclassbartlett2006convexity
Lemma 3: Superlinear raising-tail
Lemma 4
Lemma 5
Theorem 6: Calibration
Lemma 7
Lemma 8
Corollary 9
Lemma 10
...and 1 more

EnsLoss: Stochastic Calibrated Loss Ensembles for Preventing Overfitting in Classification

TL;DR

Abstract

EnsLoss: Stochastic Calibrated Loss Ensembles for Preventing Overfitting in Classification

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (11)