You can't handle the (dirty) truth: Data-centric insights improve pseudo-labeling

Nabeel Seedat; Nicolas Huynh; Fergus Imrie; Mihaela van der Schaar

You can't handle the (dirty) truth: Data-centric insights improve pseudo-labeling

Nabeel Seedat, Nicolas Huynh, Fergus Imrie, Mihaela van der Schaar

TL;DR

Pseudo-labeling relies on labeled data, but real-world labels are often noisy. The authors propose DIPS, a data-centric framework that uses learning dynamics to characterize and selectively train on useful labeled and pseudo-labeled samples, via a two-mthreshold mechanism on average confidence and aleatoric uncertainty. DIPS is plug-and-play with any pseudo-labeling method, model-agnostic, and computationally cheap, delivering consistent gains across tabular and image datasets, improving data efficiency and reducing performance gaps among pseudo-labelers. The work advocates a practical data-centric shift in semi-supervised learning, showing substantial benefits when label quality is acknowledged and leveraged in training dynamics.

Abstract

Pseudo-labeling is a popular semi-supervised learning technique to leverage unlabeled data when labeled samples are scarce. The generation and selection of pseudo-labels heavily rely on labeled data. Existing approaches implicitly assume that the labeled data is gold standard and 'perfect'. However, this can be violated in reality with issues such as mislabeling or ambiguity. We address this overlooked aspect and show the importance of investigating labeled data quality to improve any pseudo-labeling method. Specifically, we introduce a novel data characterization and selection framework called DIPS to extend pseudo-labeling. We select useful labeled and pseudo-labeled samples via analysis of learning dynamics. We demonstrate the applicability and impact of DIPS for various pseudo-labeling methods across an extensive range of real-world tabular and image datasets. Additionally, DIPS improves data efficiency and reduces the performance distinctions between different pseudo-labelers. Overall, we highlight the significant benefits of a data-centric rethinking of pseudo-labeling in real-world settings.

You can't handle the (dirty) truth: Data-centric insights improve pseudo-labeling

TL;DR

Abstract

Paper Structure (48 sections, 3 equations, 25 figures, 5 tables, 2 algorithms)

This paper contains 48 sections, 3 equations, 25 figures, 5 tables, 2 algorithms.

Introduction
Related work
Background
Semi-supervised learning via pseudo-labeling
Overlooked aspects in the current formulation of pseudo-labeling
DIPS: Data-centric insights for improved pseudo-labeling
A data-centric formulation of pseudo-labeling
Operationalizing DIPS using learning dynamics
Defining the selector $r$: data characterization and selection
Combining DIPS with any pseudo-labeling algorithm
Experiments
Synthetic example: Data characterization and unlabeled data improve test accuracy
DIPS improves different pseudo-labeling algorithms across 12 real-world tabular datasets.
DIPS improves data efficiency
DIPS improves performance of cross-country pseudo-labeling
...and 33 more sections

Figures (25)

Figure 1: (Left) Current pseudo-labeling formulations implicitly assume that the labeled data is the gold standard. (Right) However, this assumption is violated in real-world settings. Mislabeled samples lead to error propagation when pseudo-labeling the unlabeled data.
Figure 2: Stage 1 operationalizes DIPS by leveraging learning dynamics of individual labeled and pseudo-labeled samples to characterize them as Useful or Harmful. Only Useful samples are then kept for Stage 2, which consists of pseudo-labeling, using any off-the-shelf method.
Figure 3: (a)-(b) The colored dots illustrate the selected labeled and pseudo-labeled samples for the last iteration of PL and PL+DIPS, with $30 \%$ label noise. Grey dots are unselected unlabeled samples. (c) Characterizing and selecting data for the semi-supervised algorithm yields the best results (epitomized by PL+DIPS) and makes the unlabeled data impactful. (d) Characterizing and selecting data via DIPS outperforms other data-centric mechanisms
Figure 4: DIPS consistently improves the performance of all five pseudo-labeling methods across the 12 real-world datasets. DIPS also reduces the performance gap between the different pseudo-labelers.
Figure 5: DIPS (pink) improves data efficiency of vanilla methods (green), achieving the same level of performance with 60-70% fewer labeled examples, as shown by the vertical dotted lines. The results (a) Pseudo-labeling and (b) UPS are averaged across datasets and show gains in accuracy vs. the maximum performance of the vanilla method. Additionally, DIPS selection generally provides additional efficiency gains over other possible selection mechanisms.
...and 20 more figures

Theorems & Definitions (2)

Definition 4.1: Average confidence
Definition 4.2: Aleatoric uncertainty

You can't handle the (dirty) truth: Data-centric insights improve pseudo-labeling

TL;DR

Abstract

You can't handle the (dirty) truth: Data-centric insights improve pseudo-labeling

Authors

TL;DR

Abstract

Table of Contents

Figures (25)

Theorems & Definitions (2)