DynClean: Training Dynamics-based Label Cleaning for Distantly-Supervised Named Entity Recognition
Qi Zhang, Huitong Pan, Zhijia Chen, Longin Jan Latecki, Cornelia Caragea, Eduard Dragut
TL;DR
Distantly supervised NER enables scalable labeling but introduces label noise that harms performance. The authors introduce DynClean, a training-dynamics-based label cleaning method that uses metrics like AUM to characterize samples and an automatic thresholding scheme to remove mislabeled distant annotations, applied as a preprocessing step to span-based NER models. Across four DS-NER datasets and multiple base models, cleaned data yields consistent F1 improvements (3.18%–8.95%), often surpassing state-of-the-art DS-NER methods and strong LLM baselines. DynClean demonstrates that improving the quality of distantly labeled data can match or exceed gains from more complex architectures while using fewer training samples, with potential applicability to other noisy-label NLP tasks.
Abstract
Distantly Supervised Named Entity Recognition (DS-NER) has attracted attention due to its scalability and ability to automatically generate labeled data. However, distant annotation introduces many mislabeled instances, limiting its performance. Most of the existing work attempt to solve this problem by developing intricate models to learn from the noisy labels. An alternative approach is to attempt to clean the labeled data, thus increasing the quality of distant labels. This approach has received little attention for NER. In this paper, we propose a training dynamics-based label cleaning approach, which leverages the behavior of a model as training progresses to characterize the distantly annotated samples. We also introduce an automatic threshold estimation strategy to locate the errors in distant labels. Extensive experimental results demonstrate that: (1) models trained on our cleaned DS-NER datasets, which were refined by directly removing identified erroneous annotations, achieve significant improvements in F1-score, ranging from 3.18% to 8.95%; and (2) our method outperforms numerous advanced DS-NER approaches across four datasets.
