Mitigating Instance-Dependent Label Noise: Integrating Self-Supervised Pretraining with Pseudo-Label Refinement
Gouranga Bala, Anuj Gupta, Subrat Kumar Behera, Amit Sethi
TL;DR
This work tackles the problem of instance-dependent label noise by proposing a hybrid framework that combines self-supervised pretraining via SimCLR with iterative pseudo-label refinement. The method first learns noise-robust feature representations through self-supervision and then progressively improves label quality using stage-based pseudo-labeling, loss-threshold filtering, and augmented data. Key contributions include integrating self-supervised learning with consensus-based pseudo-labeling and dynamic augmentation, along with a four-iteration refinement process that demonstrates strong robustness on CIFAR-10 and CIFAR-100 under synthetic IDN, particularly when combined with DivideMix. The results indicate that the proposed approach provides a practical, noise-robust training strategy for real-world datasets where label noise is common, with potential for further improvements through adaptive staging and newer SSL techniques.
Abstract
Deep learning models rely heavily on large volumes of labeled data to achieve high performance. However, real-world datasets often contain noisy labels due to human error, ambiguity, or resource constraints during the annotation process. Instance-dependent label noise (IDN), where the probability of a label being corrupted depends on the input features, poses a significant challenge because it is more prevalent and harder to address than instance-independent noise. In this paper, we propose a novel hybrid framework that combines self-supervised learning using SimCLR with iterative pseudo-label refinement to mitigate the effects of IDN. The self-supervised pre-training phase enables the model to learn robust feature representations without relying on potentially noisy labels, establishing a noise-agnostic foundation. Subsequently, we employ an iterative training process with pseudo-label refinement, where confidently predicted samples are identified through a multistage approach and their labels are updated to improve label quality progressively. We evaluate our method on the CIFAR-10 and CIFAR-100 datasets augmented with synthetic instance-dependent noise at varying noise levels. Experimental results demonstrate that our approach significantly outperforms several state-of-the-art methods, particularly under high noise conditions, achieving notable improvements in classification accuracy and robustness. Our findings suggest that integrating self-supervised learning with iterative pseudo-label refinement offers an effective strategy for training deep neural networks on noisy datasets afflicted by instance-dependent label noise.
