Data Valuation with Gradient Similarity

Nathaniel J. Evans; Gordon B. Mills; Guanming Wu; Xubo Song; Shannon McWeeney

Data Valuation with Gradient Similarity

Nathaniel J. Evans, Gordon B. Mills, Guanming Wu, Xubo Song, Shannon McWeeney

TL;DR

The paper addresses data quality challenges in large datasets by proposing DVGS, a scalable data-valuation method that uses gradient similarity between source samples and the target task during SGD. DVGS computes cosine similarities between source gradients and the target gradient trajectory, averaging over a subset of parameter values and multiple initializations to obtain robust sample values, and it scales with complexity $O\left(\frac{N_{iter}\,N_{source}}{T}\right)$. Empirically, DVGS matches or surpasses baselines in corrupted-label detection and noise quantification across ADULT, BLOG, CIFAR10, and LINCS L1000, while delivering substantial speedups over Data Shapley and DVRL and exhibiting robustness to hyperparameters. The method enables effective automated data cleaning and improved predictive performance with reduced manual intervention, including strong utility in unsupervised LINCS analyses where traditional quality metrics like APC may be limited.

Abstract

High-quality data is crucial for accurate machine learning and actionable analytics, however, mislabeled or noisy data is a common problem in many domains. Distinguishing low- from high-quality data can be challenging, often requiring expert knowledge and considerable manual intervention. Data Valuation algorithms are a class of methods that seek to quantify the value of each sample in a dataset based on its contribution or importance to a given predictive task. These data values have shown an impressive ability to identify mislabeled observations, and filtering low-value data can boost machine learning performance. In this work, we present a simple alternative to existing methods, termed Data Valuation with Gradient Similarity (DVGS). This approach can be easily applied to any gradient descent learning algorithm, scales well to large datasets, and performs comparably or better than baseline valuation methods for tasks such as corrupted label discovery and noise quantification. We evaluate the DVGS method on tabular, image and RNA expression datasets to show the effectiveness of the method across domains. Our approach has the ability to rapidly and accurately identify low-quality data, which can reduce the need for expert knowledge and manual intervention in data cleaning tasks.

Data Valuation with Gradient Similarity

TL;DR

. Empirically, DVGS matches or surpasses baselines in corrupted-label detection and noise quantification across ADULT, BLOG, CIFAR10, and LINCS L1000, while delivering substantial speedups over Data Shapley and DVRL and exhibiting robustness to hyperparameters. The method enables effective automated data cleaning and improved predictive performance with reduced manual intervention, including strong utility in unsupervised LINCS analyses where traditional quality metrics like APC may be limited.

Abstract

Paper Structure (22 sections, 4 equations, 8 figures, 4 tables)

This paper contains 22 sections, 4 equations, 8 figures, 4 tables.

Background
Introduction
Data Valuation
Library of Integrated Network-Based Cellular Signatures
Related Work
Contributions
Proposed Methods
Data Valuation with Gradient Similarity
Time Complexity
Data
Dataset Corruption
Results
Label Corruption
Characterization of Sample Noise
Computational Complexity
...and 7 more sections

Figures (8)

Figure 1: We propose a method of data valuation that compares each source sample to the target samples by computing the similarity of gradients during stochastic gradient descent. In panel A, we depict a toy-example of a 1-d loss landscape. Sample 1 (red) is an accurately labeled (high-quality), whereas sample 2 (blue) is incorrectly labeled (low quality). In panel B, we plot the similarity of each source sample gradient compared to the target set gradient (black solid line in panel A). Panel C shows the marginal distribution of gradient similarities, which is averaged to obtain the final source sample data value. To make this process tractable, gradient similarities are computed over a limited number of model parameter values during traditional stochastic gradient descent. The computed gradients are visualized by dotted lines in panels A,B and C ($w_0$, $w_1$,...,$w_3$). To choose the relevant values of $\theta$, we use stochastic gradient descent (SGD), with gradients calculated from the target set.
Figure 2: Evaluation of respective data valuation methods ability to identify corrupted labels. The Gray dashed "random" are theoretical random performance, whereas blue/cyan "random" is empirically measured random values.
Figure 3: The evaluation of respective data valuation methods ability to impact model performance when filtering either high value (dashed lines) or low values (solid lines).
Figure 4: The evaluation of respective data valuation methods ability to impact model performance when filtering either high value (dashed lines) or low values (solid lines). The y-axis measures the model performance using the AUROC metric.
Figure 5: (a-b) The reconstruction performance ($R^2$) of autoencoders applied to the LINCS L1000 data when filtering low- and high- value data. (c-d) DVGS data values compared to APC values.
...and 3 more figures

Data Valuation with Gradient Similarity

TL;DR

Abstract

Data Valuation with Gradient Similarity

Authors

TL;DR

Abstract

Table of Contents

Figures (8)