LNL+K: Enhancing Learning with Noisy Labels Through Noise Source Knowledge Integration
Siqi Wang, Bryan A. Plummer
TL;DR
This work tackles learning with noisy labels by introducing noise-source knowledge (NS knowledge) into LNL (LNL+K), leveraging the observation that label noise often originates from a limited set of confusable categories and can be quantified via $p(c|x_i)$ and $p(c_n|x_i)$. It defines a unified clean-sample-detection framework and adapts state-of-the-art LNL methods (CRUST, FINE, SFT, UNICON, DISC) to incorporate NS knowledge, including DualT-based NS estimation. Across six datasets and two noise regimes (dominant and asymmetric), LNL+K yields substantial gains, with up to 23% accuracy improvements in dominant-noise settings and robust improvements under incomplete or estimated NS knowledge. The study introduces the notion of knowledge absorption rate and demonstrates that direct LNL+K investigation is valuable for achieving reliable learning under real-world noisy labeling scenarios, particularly when NS information is partial or noisy.
Abstract
Learning with noisy labels (LNL) aims to train a high-performing model using a noisy dataset. We observe that noise for a given class often comes from a limited set of categories, yet many LNL methods overlook this. For example, an image mislabeled as a cheetah is more likely a leopard than a hippopotamus due to its visual similarity. Thus, we explore Learning with Noisy Labels with noise source Knowledge integration (LNL+K), which leverages knowledge about likely source(s) of label noise that is often provided in a dataset's meta-data. Integrating noise source knowledge boosts performance even in settings where LNL methods typically fail. For example, LNL+K methods are effective on datasets where noise represents the majority of samples, which breaks a critical premise of most methods developed for LNL. Our LNL+K methods can boost performance even when noise sources are estimated rather than extracted from meta-data. We provide several baseline LNL+K methods that integrate noise source knowledge into state-of-the-art LNL models that are evaluated across six diverse datasets and two types of noise, where we report gains of up to 23% compared to the unadapted methods. Critically, we show that LNL methods fail to generalize on some real-world datasets, even when adapted to integrate noise source knowledge, highlighting the importance of directly exploring LNL+K.
