Table of Contents
Fetching ...

Combating Semantic Contamination in Learning with Label Noise

Wenxiao Fan, Kan Li

TL;DR

The paper identifies Semantic Contamination as a fundamental failure mode in learning with noisy labels and shows that standard label refurbishment methods fail to preserve robust semantic structure when combining views or models. It introduces Collaborative Cross Learning, which decouples semantic concepts from class labels (SDCL) and enforces embedding alignment across models (EIA) via a joint loss that combines cross-view and cross-model terms with confidence-based sample selection. Empirical results across CIFAR benchmarks and real-world noisy datasets demonstrate strong improvements over state-of-the-art methods, with ablations confirming the necessity of both SDCL and EIA components. The approach yields more coherent semantic representations and reduces the impact of label noise, offering a practical pathway for robust learning under noisy supervision.

Abstract

Noisy labels can negatively impact the performance of deep neural networks. One common solution is label refurbishment, which involves reconstructing noisy labels through predictions and distributions. However, these methods may introduce problematic semantic associations, a phenomenon that we identify as Semantic Contamination. Through an analysis of Robust LR, a representative label refurbishment method, we found that utilizing the logits of views for refurbishment does not adequately balance the semantic information of individual classes. Conversely, using the logits of models fails to maintain consistent semantic relationships across models, which explains why label refurbishment methods frequently encounter issues related to Semantic Contamination. To address this issue, we propose a novel method called Collaborative Cross Learning, which utilizes semi-supervised learning on refurbished labels to extract appropriate semantic associations from embeddings across views and models. Experimental results show that our method outperforms existing approaches on both synthetic and real-world noisy datasets, effectively mitigating the impact of label noise and Semantic Contamination.

Combating Semantic Contamination in Learning with Label Noise

TL;DR

The paper identifies Semantic Contamination as a fundamental failure mode in learning with noisy labels and shows that standard label refurbishment methods fail to preserve robust semantic structure when combining views or models. It introduces Collaborative Cross Learning, which decouples semantic concepts from class labels (SDCL) and enforces embedding alignment across models (EIA) via a joint loss that combines cross-view and cross-model terms with confidence-based sample selection. Empirical results across CIFAR benchmarks and real-world noisy datasets demonstrate strong improvements over state-of-the-art methods, with ablations confirming the necessity of both SDCL and EIA components. The approach yields more coherent semantic representations and reduces the impact of label noise, offering a practical pathway for robust learning under noisy supervision.

Abstract

Noisy labels can negatively impact the performance of deep neural networks. One common solution is label refurbishment, which involves reconstructing noisy labels through predictions and distributions. However, these methods may introduce problematic semantic associations, a phenomenon that we identify as Semantic Contamination. Through an analysis of Robust LR, a representative label refurbishment method, we found that utilizing the logits of views for refurbishment does not adequately balance the semantic information of individual classes. Conversely, using the logits of models fails to maintain consistent semantic relationships across models, which explains why label refurbishment methods frequently encounter issues related to Semantic Contamination. To address this issue, we propose a novel method called Collaborative Cross Learning, which utilizes semi-supervised learning on refurbished labels to extract appropriate semantic associations from embeddings across views and models. Experimental results show that our method outperforms existing approaches on both synthetic and real-world noisy datasets, effectively mitigating the impact of label noise and Semantic Contamination.

Paper Structure

This paper contains 40 sections, 22 equations, 7 figures, 9 tables, 1 algorithm.

Figures (7)

  • Figure 1: Illustration of Semantic Contamination. Air. is short for airplane. After being trained with noisy labels, models may learn the problematic semantic pairs, such as (cat, airplane) is more similar than (cat, dog). This issue may cause the model to learn incorrect feature spaces, impacting its performance. In this study, we mainly focus on how to enable the model to learn reasonable semantic information in order to overcome Semantic Contamination.
  • Figure 2: Illustration of Semantic Imbalance Among Classes. \ref{['fig:view_anly_1']} shows that our method (orange) can learn more balance representations compared with RoLR (blue).
  • Figure 3: Evaluation results of semantic consistency across models between RoLR (blue) with our method (orange) on CIFAR-10 with different symmetric noise (20%, 50% and 80%).
  • Figure 4: Pipeline of our method. The details of Warm-up and Confidence estimation by small-loss criterion and GMM can be found in Appendix.
  • Figure 5: Results on different augmentation strategies on CIFAR-100 under 80% symmetric noises.
  • ...and 2 more figures