Combating Semantic Contamination in Learning with Label Noise
Wenxiao Fan, Kan Li
TL;DR
The paper identifies Semantic Contamination as a fundamental failure mode in learning with noisy labels and shows that standard label refurbishment methods fail to preserve robust semantic structure when combining views or models. It introduces Collaborative Cross Learning, which decouples semantic concepts from class labels (SDCL) and enforces embedding alignment across models (EIA) via a joint loss that combines cross-view and cross-model terms with confidence-based sample selection. Empirical results across CIFAR benchmarks and real-world noisy datasets demonstrate strong improvements over state-of-the-art methods, with ablations confirming the necessity of both SDCL and EIA components. The approach yields more coherent semantic representations and reduces the impact of label noise, offering a practical pathway for robust learning under noisy supervision.
Abstract
Noisy labels can negatively impact the performance of deep neural networks. One common solution is label refurbishment, which involves reconstructing noisy labels through predictions and distributions. However, these methods may introduce problematic semantic associations, a phenomenon that we identify as Semantic Contamination. Through an analysis of Robust LR, a representative label refurbishment method, we found that utilizing the logits of views for refurbishment does not adequately balance the semantic information of individual classes. Conversely, using the logits of models fails to maintain consistent semantic relationships across models, which explains why label refurbishment methods frequently encounter issues related to Semantic Contamination. To address this issue, we propose a novel method called Collaborative Cross Learning, which utilizes semi-supervised learning on refurbished labels to extract appropriate semantic associations from embeddings across views and models. Experimental results show that our method outperforms existing approaches on both synthetic and real-world noisy datasets, effectively mitigating the impact of label noise and Semantic Contamination.
