Table of Contents
Fetching ...

LossVal: Efficient Data Valuation for Neural Networks

Tim Wibiral, Mohamed Karim Belaid, Maximilian Rabus, Ansgar Scherp

TL;DR

LossVal introduces per-sample weights into neural-network training via a self-weighting loss and a weighted optimal-transport term, formalized as $LossVal = L_w(y, y\_hat) \cdot OT_w(X_{train}, X_{val})^{2}$, to identify informative versus noisy data points. By embedding the weighting directly into the training objective, LossVal yields gradient signals that propagate across samples, enabling in-run data valuation without full retraining. Empirically, LossVal achieves state-of-the-art or competitive results on the OpenDataVal benchmark across classification and regression tasks, and demonstrates effective active data acquisition on crash-test data, all with favorable computational efficiency. The work highlights practical implications for data-centric ML, including robust handling of feature and label noise and scalable evaluation of data quality for large datasets and costly data acquisition scenarios.

Abstract

Assessing the importance of individual training samples is a key challenge in machine learning. Traditional approaches retrain models with and without specific samples, which is computationally expensive and ignores dependencies between data points. We introduce LossVal, an efficient data valuation method that computes importance scores during neural network training by embedding a self-weighting mechanism into loss functions like cross-entropy and mean squared error. LossVal reduces computational costs, making it suitable for large datasets and practical applications. Experiments on classification and regression tasks across multiple datasets show that LossVal effectively identifies noisy samples and is able to distinguish helpful from harmful samples. We examine the gradient calculation of LossVal to highlight its advantages. The source code is available at: https://github.com/twibiral/LossVal

LossVal: Efficient Data Valuation for Neural Networks

TL;DR

LossVal introduces per-sample weights into neural-network training via a self-weighting loss and a weighted optimal-transport term, formalized as , to identify informative versus noisy data points. By embedding the weighting directly into the training objective, LossVal yields gradient signals that propagate across samples, enabling in-run data valuation without full retraining. Empirically, LossVal achieves state-of-the-art or competitive results on the OpenDataVal benchmark across classification and regression tasks, and demonstrates effective active data acquisition on crash-test data, all with favorable computational efficiency. The work highlights practical implications for data-centric ML, including robust handling of feature and label noise and scalable evaluation of data quality for large datasets and costly data acquisition scenarios.

Abstract

Assessing the importance of individual training samples is a key challenge in machine learning. Traditional approaches retrain models with and without specific samples, which is computationally expensive and ignores dependencies between data points. We introduce LossVal, an efficient data valuation method that computes importance scores during neural network training by embedding a self-weighting mechanism into loss functions like cross-entropy and mean squared error. LossVal reduces computational costs, making it suitable for large datasets and practical applications. Experiments on classification and regression tasks across multiple datasets show that LossVal effectively identifies noisy samples and is able to distinguish helpful from harmful samples. We examine the gradient calculation of LossVal to highlight its advantages. The source code is available at: https://github.com/twibiral/LossVal

Paper Structure

This paper contains 50 sections, 12 equations, 10 figures, 12 tables.

Figures (10)

  • Figure 1: F1 scores calculated between the set of correct noisy samples and the noisy samples found and averaged over all datasets. Higher is better. In the lower plots, the line graphs of some methods are obscured as they fall almost on the x-axis (very low F1 scores).
  • Figure 2: Adding $x\%$ of data points with low importance score to the training data, averaged over all datasets. A lower curve is better.
  • Figure 3: Removing $x\%$ data points with a high importance score from the training data. A lower curve is better.
  • Figure 4: Exemplary crash test showing vehicle and occupant acceleration (top) and speed (bottom), as well as the corresponding $ROLC_p$ model.
  • Figure 5: Experimental setup for the active data acquisition.
  • ...and 5 more figures