Table of Contents
Fetching ...

Training Neural Networks on Data Sources with Unknown Reliability

Alexander Capstick, Francesca Palermo, Tianyu Cui, Payam Barnaghi

TL;DR

The paper tackles learning from multi-source data with unknown reliability by introducing Loss Adapted Plasticity (LAP), a training strategy that temperingly weights source contributions based on a history-guided reliability score. LAP maintains a per-source temperature derived from $C_s$ via $T_s=f(C_s)$ and scales gradients with $\hat{g}_s=f(C_s)g_s$, allowing early learning from all sources and gradual downweighting of unreliable ones. Across diverse datasets and noise types, LAP yields consistent improvements over standard training and many noisy-data baselines, while often requiring less compute than methods that train multiple models or complicated architectures. The approach is robust to varying numbers and types of noisy sources and is applicable to classification, regression, and multiple data modalities, with open-source code for reproduction. This work contributes a scalable, source-aware technique for mitigating data quality issues in real-world, multi-source settings.

Abstract

When data is generated by multiple sources, conventional training methods update models assuming equal reliability for each source and do not consider their individual data quality. However, in many applications, sources have varied levels of reliability that can have negative effects on the performance of a neural network. A key issue is that often the quality of the data for individual sources is not known during training. Previous methods for training models in the presence of noisy data do not make use of the additional information that the source label can provide. Focusing on supervised learning, we aim to train neural networks on each data source for a number of steps proportional to the source's estimated reliability by using a dynamic re-weighting strategy motivated by likelihood tempering. This way, we allow training on all sources during the warm-up and reduce learning on less reliable sources during the final training stages, when it has been shown that models overfit to noise. We show through diverse experiments that this can significantly improve model performance when trained on mixtures of reliable and unreliable data sources, and maintain performance when models are trained on reliable sources only.

Training Neural Networks on Data Sources with Unknown Reliability

TL;DR

The paper tackles learning from multi-source data with unknown reliability by introducing Loss Adapted Plasticity (LAP), a training strategy that temperingly weights source contributions based on a history-guided reliability score. LAP maintains a per-source temperature derived from via and scales gradients with , allowing early learning from all sources and gradual downweighting of unreliable ones. Across diverse datasets and noise types, LAP yields consistent improvements over standard training and many noisy-data baselines, while often requiring less compute than methods that train multiple models or complicated architectures. The approach is robust to varying numbers and types of noisy sources and is applicable to classification, regression, and multiple data modalities, with open-source code for reproduction. This work contributes a scalable, source-aware technique for mitigating data quality issues in real-world, multi-source settings.

Abstract

When data is generated by multiple sources, conventional training methods update models assuming equal reliability for each source and do not consider their individual data quality. However, in many applications, sources have varied levels of reliability that can have negative effects on the performance of a neural network. A key issue is that often the quality of the data for individual sources is not known during training. Previous methods for training models in the presence of noisy data do not make use of the additional information that the source label can provide. Focusing on supervised learning, we aim to train neural networks on each data source for a number of steps proportional to the source's estimated reliability by using a dynamic re-weighting strategy motivated by likelihood tempering. This way, we allow training on all sources during the warm-up and reduce learning on less reliable sources during the final training stages, when it has been shown that models overfit to noise. We show through diverse experiments that this can significantly improve model performance when trained on mixtures of reliable and unreliable data sources, and maintain performance when models are trained on reliable sources only.
Paper Structure (28 sections, 14 equations, 9 figures, 13 tables, 1 algorithm)

This paper contains 28 sections, 14 equations, 9 figures, 13 tables, 1 algorithm.

Figures (9)

  • Figure 1: Loss Adapted Plasticity: Here, ER refers to the empirical risk.
  • Figure 2: Comparing source noise estimation with and without source knowledge.
  • Figure 3: Visualisation of Equation \ref{['eq:lap_method_equation']}. Each colour represents the loss values from a single source over a small number of steps, with its density weighted by its temperature, $w(s) = f(C_s)$. This shows how sources contribute to $\mu_{s'}$ and $\sigma_{s'}^2$ as their $C_s$ changes during training and given the leniency $\lambda$. These values are synthetic and for demonstration.
  • Figure 4: Effect of the introduced parameters on training. Section \ref{['sec:methods']}, introduces three parameters that control the effects of LAP. $1-d_s$ is multiplied by the gradient (equivalently, loss) contribution from a given source before the model is updated. Here, we show these values for each source (the different coloured lines) during model training on synthetic data (Appendix \ref{['sec:toy_example']}). Unless stated in the title of a given plot, the parameters of LAP were set to $H=25$, $\delta=1.0$, $\lambda=1.0$. We had $5$ sources with noise levels of $0.0$, $0.025$, $0.05$, $0.25$, and $1.0$ (a darker colour indicates a higher noise rate).
  • Figure 5: LAP results with a varied number of sources and noise levels. In \ref{['fig:ptbxl_aucpr_vs_corruption']} we show the area under the precision-recall curve for standard training and using LAP on PTB-XL with label noise and simulated ECG interference noise for $12$ total sources. In \ref{['fig:cifar10n_acc_vs_corruption_presnet']} we show the accuracy on CIFAR-10N with real human labelling noise when using RRL and RRL + LAP, with $10$ sources. In both, the noise of the sources varies linearly from $25\%$ to $100\%$ for each number of noisy sources. The lines and error bands represent the mean and standard deviation of the maximum value for each of the 5 repeats. These figures illustrate that LAP maintains higher performance as noise rates increase.
  • ...and 4 more figures