Hide and Seek in Noise Labels: Noise-Robust Collaborative Active Learning with LLM-Powered Assistance
Bo Yuan, Yulin Chen, Yin Zhang, Wei Jiang
TL;DR
This work tackles the challenge of learning from noisy labels in text classification by introducing NoiseAL, a collaborative active-learning framework that couples two small models (SMs) with an LLM-based annotator. A co-prediction network of SMs, guided by a dynamic-enhanced threshold, partitions data into Consistent/Discrepant sets and further into R (clean), P (purified), and H (hard) subsets; the LLM is then used to generate labels for P and provide demonstrations from R to bolster in-context learning. The SMs are trained with targeted losses on each subset—cross-entropy on R, reversed cross-entropy on P, and EmbMix-based regularization on H—resulting in a final objective L = L_R + L_P + L_H. Extensive experiments on synthetic and real-world noisy datasets show NoiseAL consistently outperforms state-of-the-art baselines, demonstrates robustness to instance-dependent noise, and highlights cost-effective use of LLMs for label denoising. The approach offers a practical pathway to scalable, noise-robust learning by fusing SM filters with LLM-powered correction and demonstration-enabled ICL.
Abstract
Learning from noisy labels (LNL) is a challenge that arises in many real-world scenarios where collected training data can contain incorrect or corrupted labels. Most existing solutions identify noisy labels and adopt active learning to query human experts on them for denoising. In the era of large language models (LLMs), although we can reduce the human effort to improve these methods, their performances are still subject to accurately separating the clean and noisy samples from noisy data. In this paper, we propose an innovative collaborative learning framework NoiseAL based on active learning to combine LLMs and small models (SMs) for learning from noisy labels. During collaborative training, we first adopt two SMs to form a co-prediction network and propose a dynamic-enhanced threshold strategy to divide the noisy data into different subsets, then select the clean and noisy samples from these subsets to feed the active annotator LLMs to rectify noisy samples. Finally, we employ different optimization objectives to conquer subsets with different degrees of label noises. Extensive experiments on synthetic and real-world noise datasets further demonstrate the superiority of our framework over state-of-the-art baselines.
