Table of Contents
Fetching ...

Learning with Confidence: Training Better Classifiers from Soft Labels

Sjoerd de Vries, Dirk Thierens

TL;DR

This study investigates whether incorporating label uncertainty, represented for each instance as a discrete probability distribution over the class labels, known as a soft label, improves the predictive performance of classification models, focusing on tabular data.

Abstract

In supervised machine learning, models are typically trained using data with hard labels, i.e., definite assignments of class membership. This traditional approach, however, does not take the inherent uncertainty in these labels into account. We investigate whether incorporating label uncertainty, represented as discrete probability distributions over the class labels -- known as soft labels -- improves the predictive performance of classification models. We first demonstrate the potential value of soft label learning (SLL) for estimating model parameters in a simulation experiment, particularly for limited sample sizes and imbalanced data. Subsequently, we compare the performance of various wrapper methods for learning from both hard and soft labels using identical base classifiers. On real-world-inspired synthetic data with clean labels, the SLL methods consistently outperform hard label methods. Since real-world data is often noisy and precise soft labels are challenging to obtain, we study the effect that noisy probability estimates have on model performance. Alongside conventional noise models, our study examines four types of miscalibration that are known to affect human annotators. The results show that SLL methods outperform the hard label methods in the majority of settings. Finally, we evaluate the methods on a real-world dataset with confidence scores, where the SLL methods are shown to match the traditional methods for predicting the (noisy) hard labels while providing more accurate confidence estimates.

Learning with Confidence: Training Better Classifiers from Soft Labels

TL;DR

This study investigates whether incorporating label uncertainty, represented for each instance as a discrete probability distribution over the class labels, known as a soft label, improves the predictive performance of classification models, focusing on tabular data.

Abstract

In supervised machine learning, models are typically trained using data with hard labels, i.e., definite assignments of class membership. This traditional approach, however, does not take the inherent uncertainty in these labels into account. We investigate whether incorporating label uncertainty, represented as discrete probability distributions over the class labels -- known as soft labels -- improves the predictive performance of classification models. We first demonstrate the potential value of soft label learning (SLL) for estimating model parameters in a simulation experiment, particularly for limited sample sizes and imbalanced data. Subsequently, we compare the performance of various wrapper methods for learning from both hard and soft labels using identical base classifiers. On real-world-inspired synthetic data with clean labels, the SLL methods consistently outperform hard label methods. Since real-world data is often noisy and precise soft labels are challenging to obtain, we study the effect that noisy probability estimates have on model performance. Alongside conventional noise models, our study examines four types of miscalibration that are known to affect human annotators. The results show that SLL methods outperform the hard label methods in the majority of settings. Finally, we evaluate the methods on a real-world dataset with confidence scores, where the SLL methods are shown to match the traditional methods for predicting the (noisy) hard labels while providing more accurate confidence estimates.
Paper Structure (33 sections, 4 equations, 23 figures, 4 tables)

This paper contains 33 sections, 4 equations, 23 figures, 4 tables.

Figures (23)

  • Figure 1: $\Delta \overline{MSE}$ (soft,hard) for different number of samples taken from the true class distributions for different values of the prior probability of class one, $p(C_1)$. Shown without noise (a) and with noise (b) added to the soft labels.
  • Figure 2: The four miscalibration noise models, defined by Equation \ref{['eq:miscalibration']}, for $\beta$ = 0.3.
  • Figure 3: Heat map illustrating the performance of various methods with SGD as base classifier across multiple datasets, along with the their mean performance over all datasets, measured by the AUC on $y^G$. All values were multiplied by $100$ to enhance readability. Red cells indicate higher AUC values, while blue cells represent lower values relative to the AUC of the PluralityBootstrapClf for each dataset.
  • Figure 4: Heat map illustrating the performance of various methods using four base classifier averaged over all datasets. Performance is measured by the AUC on $y^G$ and $\overline{TVD}$ on $y^{PG}$. The $\overline{TVD}$ values were multiplied by $-1$, to allow for easier comparison with the AUC. All values were multiplied by $100$ to enhance readability. Red cells indicate better performance, while blue cells indicate worse performance than PluralityBootstrapClf for each combination of base classifier and metric.
  • Figure 5: The effect of six different noise types on method performance with four different base classifiers, measured by the AUC on the ground truth test data, across multiple noise levels. Noise types include NCAR, NAR, overprediction, underprediction, underextremity and overextremity, with noise levels ranging from level 0 (noiseless) to level 6 (noise strength 0.3). LR, SGD, GNB and DT were used as base classifiers. RF served as the ground truth model, with the soft labels generated at the high uncertainty level.
  • ...and 18 more figures