Table of Contents
Fetching ...

Pairwise Similarity Distribution Clustering for Noisy Label Learning

Sihan Bai

TL;DR

This work tackles learning with noisy labels by introducing Pairwise Similarity Distribution Clustering (PSDC), which leverages per-class pairwise feature structure to separate clean and noisy samples via a two-component Gaussian Mixture Model on aggregated affinity scores. The method rests on a theoretical foundation (submerged-condition and Lyapunov-based analyses) that explains when affinity-based partitioning can reliably distinguish clean from noisy data, independent of direct label information. Empirically, PSDC improves data partitioning and, when combined with semi-supervised learning like MixMatch and contrastive learning, yields state-of-the-art or competitive results on CIFAR-10/100 and Clothing1M across various noise regimes. The approach offers a robust, scalable alternative to loss-based or label-correction strategies, with practical impact for training deep models in settings with substantial label noise.

Abstract

Noisy label learning aims to train deep neural networks using a large amount of samples with noisy labels, whose main challenge comes from how to deal with the inaccurate supervision caused by wrong labels. Existing works either take the label correction or sample selection paradigm to involve more samples with accurate labels into the training process. In this paper, we propose a simple yet effective sample selection algorithm, termed as Pairwise Similarity Distribution Clustering~(PSDC), to divide the training samples into one clean set and another noisy set, which can power any of the off-the-shelf semi-supervised learning regimes to further train networks for different downstream tasks. Specifically, we take the pairwise similarity between sample pairs to represent the sample structure, and the Gaussian Mixture Model~(GMM) to model the similarity distribution between sample pairs belonging to the same noisy cluster, therefore each sample can be confidently divided into the clean set or noisy set. Even under severe label noise rate, the resulting data partition mechanism has been proved to be more robust in judging the label confidence in both theory and practice. Experimental results on various benchmark datasets, such as CIFAR-10, CIFAR-100 and Clothing1M, demonstrate significant improvements over state-of-the-art methods.

Pairwise Similarity Distribution Clustering for Noisy Label Learning

TL;DR

This work tackles learning with noisy labels by introducing Pairwise Similarity Distribution Clustering (PSDC), which leverages per-class pairwise feature structure to separate clean and noisy samples via a two-component Gaussian Mixture Model on aggregated affinity scores. The method rests on a theoretical foundation (submerged-condition and Lyapunov-based analyses) that explains when affinity-based partitioning can reliably distinguish clean from noisy data, independent of direct label information. Empirically, PSDC improves data partitioning and, when combined with semi-supervised learning like MixMatch and contrastive learning, yields state-of-the-art or competitive results on CIFAR-10/100 and Clothing1M across various noise regimes. The approach offers a robust, scalable alternative to loss-based or label-correction strategies, with practical impact for training deep models in settings with substantial label noise.

Abstract

Noisy label learning aims to train deep neural networks using a large amount of samples with noisy labels, whose main challenge comes from how to deal with the inaccurate supervision caused by wrong labels. Existing works either take the label correction or sample selection paradigm to involve more samples with accurate labels into the training process. In this paper, we propose a simple yet effective sample selection algorithm, termed as Pairwise Similarity Distribution Clustering~(PSDC), to divide the training samples into one clean set and another noisy set, which can power any of the off-the-shelf semi-supervised learning regimes to further train networks for different downstream tasks. Specifically, we take the pairwise similarity between sample pairs to represent the sample structure, and the Gaussian Mixture Model~(GMM) to model the similarity distribution between sample pairs belonging to the same noisy cluster, therefore each sample can be confidently divided into the clean set or noisy set. Even under severe label noise rate, the resulting data partition mechanism has been proved to be more robust in judging the label confidence in both theory and practice. Experimental results on various benchmark datasets, such as CIFAR-10, CIFAR-100 and Clothing1M, demonstrate significant improvements over state-of-the-art methods.
Paper Structure (13 sections, 2 theorems, 12 equations, 4 figures, 6 tables, 2 algorithms)

This paper contains 13 sections, 2 theorems, 12 equations, 4 figures, 6 tables, 2 algorithms.

Key Result

theorem 1

Consider two pairs of samples, ${(x_p,\tilde{y}),(x_q,\tilde{y})}$, randomly selected from $\mathbb{G}_i$ , with their respective indices in the affinity matrix $A^i$ being $p$ and $q$. Given the following conditions: Then the mean value of row $p$ on the affinity matrix $A^i$ follows a Gaussian distribution with mean $\mu_p$ and the mean value of row $q$ in the affinity matrix $A^i$ follows a Ga

Figures (4)

  • Figure 1: Illustration of sample selection through pairwise similarity distribution clustering. For each group with the same label, we first calculate the cosine distance between all sample pairs, then summarize the distribution matrix by row and divide all the samples into two groups using gaussian mixture model.
  • Figure 2: Illustration of sample selection examples by our PSDC on the Clothing1M, CIFAR-100, and CIFAR-10 datasets. In particular, each image includes an assigned label at the bottom indicating whether it is clean or noisy. The clean labels are surrounded by green borders, while the noisy labels are bordered in red. The selection results are indicated by checkmark and cross, in which the noisy label is marked by cross and the clean label is marked by checkmark. Besides, the red checkmark or cross means that our PSDC makes wrong data partition to this sample.
  • Figure 3: Illustration of the semi-supervised training framewor. At each time $t$, the current network is first used to extract features for all training samples. Then, these features are taken to divide the training sample into clean set and noisy set using our PSDC algorithm. Finally, the sample selection results are further taken to power the semi-supervised learning regime. Once the current network is updated at time $t+1$, it is used to conduct sample selection in a new round. With more and more samples are correctly divided into clean set and noisy set, the network will also become powerful enough in the semi-supervised training manner.
  • Figure 4: Accuracy of clean sets using different methods with 50% symmetric noise added, where samples are clustered using GMM based on extracted features, cross-entropy loss, and pairwise similarity measures, respectively on the CIFAR-100 dataset.

Theorems & Definitions (4)

  • definition 1
  • definition 2
  • theorem 1
  • theorem 2