Training Neural Networks on Data Sources with Unknown Reliability
Alexander Capstick, Francesca Palermo, Tianyu Cui, Payam Barnaghi
TL;DR
The paper tackles learning from multi-source data with unknown reliability by introducing Loss Adapted Plasticity (LAP), a training strategy that temperingly weights source contributions based on a history-guided reliability score. LAP maintains a per-source temperature derived from $C_s$ via $T_s=f(C_s)$ and scales gradients with $\hat{g}_s=f(C_s)g_s$, allowing early learning from all sources and gradual downweighting of unreliable ones. Across diverse datasets and noise types, LAP yields consistent improvements over standard training and many noisy-data baselines, while often requiring less compute than methods that train multiple models or complicated architectures. The approach is robust to varying numbers and types of noisy sources and is applicable to classification, regression, and multiple data modalities, with open-source code for reproduction. This work contributes a scalable, source-aware technique for mitigating data quality issues in real-world, multi-source settings.
Abstract
When data is generated by multiple sources, conventional training methods update models assuming equal reliability for each source and do not consider their individual data quality. However, in many applications, sources have varied levels of reliability that can have negative effects on the performance of a neural network. A key issue is that often the quality of the data for individual sources is not known during training. Previous methods for training models in the presence of noisy data do not make use of the additional information that the source label can provide. Focusing on supervised learning, we aim to train neural networks on each data source for a number of steps proportional to the source's estimated reliability by using a dynamic re-weighting strategy motivated by likelihood tempering. This way, we allow training on all sources during the warm-up and reduce learning on less reliable sources during the final training stages, when it has been shown that models overfit to noise. We show through diverse experiments that this can significantly improve model performance when trained on mixtures of reliable and unreliable data sources, and maintain performance when models are trained on reliable sources only.
