Table of Contents
Fetching ...

Intra-Cluster Mixup: An Effective Data Augmentation Technique for Complementary-Label Learning

Tan-Ha Mai, Hsuan-Tien Lin

TL;DR

This work tackles complementary-label learning (CLL), where training relies on labels indicating classes an instance does not belong. It identifies Mixup as unsuitable for CLL due to complementary-label noise and introduces Intra-Cluster Mixup (ICM), which performs in-cluster data augmentation by clustering embeddings (via SimSiam) and mixing samples within the same cluster. ICM is integrated with surrogate complementary losses to form a new training paradigm that reduces noise and improves generalization, achieving substantial gains on MNIST, CIFAR, and real-world CLCIFAR datasets, including notable improvements under imbalanced conditions. The approach yields practical benefits for real-world CLL applications by enabling more accurate and reliable models with cheaper supervision.

Abstract

In this paper, we investigate the challenges of complementary-label learning (CLL), a specialized form of weakly-supervised learning (WSL) where models are trained with labels indicating classes to which instances do not belong, rather than standard ordinary labels. This alternative supervision is appealing because collecting complementary labels is generally cheaper and less labor-intensive. Although most existing research in CLL emphasizes the development of novel loss functions, the potential of data augmentation in this domain remains largely underexplored. In this work, we uncover that the widely-used Mixup data augmentation technique is ineffective when directly applied to CLL. Through in-depth analysis, we identify that the complementary-label noise generated by Mixup negatively impacts the performance of CLL models. We then propose an improved technique called Intra-Cluster Mixup (ICM), which only synthesizes augmented data from nearby examples, to mitigate the noise effect. ICM carries the benefits of encouraging complementary label sharing of nearby examples, and leads to substantial performance improvements across synthetic and real-world labeled datasets. In particular, our wide spectrum of experimental results on both balanced and imbalanced CLL settings justifies the potential of ICM in allying with state-of-the-art CLL algorithms, achieving significant accuracy increases of 30% and 10% on MNIST and CIFAR datasets, respectively.

Intra-Cluster Mixup: An Effective Data Augmentation Technique for Complementary-Label Learning

TL;DR

This work tackles complementary-label learning (CLL), where training relies on labels indicating classes an instance does not belong. It identifies Mixup as unsuitable for CLL due to complementary-label noise and introduces Intra-Cluster Mixup (ICM), which performs in-cluster data augmentation by clustering embeddings (via SimSiam) and mixing samples within the same cluster. ICM is integrated with surrogate complementary losses to form a new training paradigm that reduces noise and improves generalization, achieving substantial gains on MNIST, CIFAR, and real-world CLCIFAR datasets, including notable improvements under imbalanced conditions. The approach yields practical benefits for real-world CLL applications by enabling more accurate and reliable models with cheaper supervision.

Abstract

In this paper, we investigate the challenges of complementary-label learning (CLL), a specialized form of weakly-supervised learning (WSL) where models are trained with labels indicating classes to which instances do not belong, rather than standard ordinary labels. This alternative supervision is appealing because collecting complementary labels is generally cheaper and less labor-intensive. Although most existing research in CLL emphasizes the development of novel loss functions, the potential of data augmentation in this domain remains largely underexplored. In this work, we uncover that the widely-used Mixup data augmentation technique is ineffective when directly applied to CLL. Through in-depth analysis, we identify that the complementary-label noise generated by Mixup negatively impacts the performance of CLL models. We then propose an improved technique called Intra-Cluster Mixup (ICM), which only synthesizes augmented data from nearby examples, to mitigate the noise effect. ICM carries the benefits of encouraging complementary label sharing of nearby examples, and leads to substantial performance improvements across synthetic and real-world labeled datasets. In particular, our wide spectrum of experimental results on both balanced and imbalanced CLL settings justifies the potential of ICM in allying with state-of-the-art CLL algorithms, achieving significant accuracy increases of 30% and 10% on MNIST and CIFAR datasets, respectively.

Paper Structure

This paper contains 31 sections, 1 theorem, 18 equations, 16 figures, 9 tables.

Key Result

Proposition 1

For Mixup-generated pairs $(\tilde{\mathbf{x}}_{i,j}, \tilde{y}_{i,j})$, the complementary classification risk under Mixup is and admits the decomposition where $\varepsilon_i$ and $\varepsilon_j$ are the local noise errors defined in equation error, and satisfy $\varepsilon(g) = \frac{1}{N}\sum_{i=1}^N \varepsilon_i.$ Thus, the Mixup risk $\mathcal{R}'(g;\ell)$ consists of two classification-er

Figures (16)

  • Figure 1: Illustration of the Intra-Cluster Mixup (ICM) framework. Top: Embedding features are extracted using a pretrained SimSiam encoder and clustered using $k$-means, aiming to group samples with similar ordinary labels. Bottom right: Within each cluster, ICM generates synthetic samples by interpolating features and labels, which are then used to train the classifier.
  • Figure 2: Analysis of the impact of noise and Mixup Noise-Free (NF) on complementary-label learning performance.
  • Figure 3: ICM training with cluster-consistent Mixup. Lines 1–3: extract SimSiam embeddings and assign $k$ clusters. Lines 4–12: synthesize $(\tilde{\mathbf{x}},\tilde{y})$ by interpolating pairs within the same cluster using Eq. (\ref{['eq3']})–(\ref{['eq4']}). Lines 13–14: update $\theta$ on the synthetic batch.
  • Figure 4: Comparison of the noise ratio across datasets (left) and the test accuracy of Mixup and ICM for different algorithms on CIFAR10 (right).
  • Figure 5: Comparing the p-value of different between Mixup and ICM method on CIFAR10 and CLCIFAR10 with S-NL (right) and FWD (left) algorithms on both balanced and imbalanced ($\rho=100$) scenarios.
  • ...and 11 more figures

Theorems & Definitions (5)

  • Definition 1: Complementary classification error
  • Definition 2: Error generated by label noise
  • Proposition 1: Complementary error with Mixup
  • proof
  • proof