Table of Contents
Fetching ...

Eigen-Value: Efficient Domain-Robust Data Valuation via Eigenvalue-Based Approach

Youngjun Choi, Joonseong Kang, Sungjun Lim, Kyungwoo Song

TL;DR

This work tackles data-centric robustness under domain shift by introducing Eigen-Value (EV), a plug-and-play data valuation term that captures OOD risk through eigenvalue shifts of the data covariance derived from ID data. EV uses perturbation theory to approximate how removing a single point alters the top and bottom eigenvalues, enabling estimation of a point's marginal contribution to domain discrepancy without requiring OOD samples. By bounding domain discrepancy with the ratio of Hessian eigenvalues and expressing it in terms of the data covariance, EV provides a scalable, model-agnostic component that can be added to existing ID-based valuations. Empirical results across diverse vision and language datasets show that EV improves OOD robustness and stabilizes value rankings with minimal computational overhead, making it practical for large-scale data markets and continual learning. Overall, EV shifts the focus from model-centric to data-centric OOD robustness by leveraging spectral properties to guide informative data selection and pricing under domain shift.

Abstract

Data valuation has become central in the era of data-centric AI. It drives efficient training pipelines and enables objective pricing in data markets by assigning a numeric value to each data point. Most existing data valuation methods estimate the effect of removing individual data points by evaluating changes in model validation performance under in-distribution (ID) settings, as opposed to out-of-distribution (OOD) scenarios where data follow different patterns. Since ID and OOD data behave differently, data valuation methods based on ID loss often fail to generalize to OOD settings, particularly when the validation set contains no OOD data. Furthermore, although OOD-aware methods exist, they involve heavy computational costs, which hinder practical deployment. To address these challenges, we introduce \emph{Eigen-Value} (EV), a plug-and-play data valuation framework for OOD robustness that uses only an ID data subset, including during validation. EV provides a new spectral approximation of domain discrepancy, which is the gap of loss between ID and OOD using ratios of eigenvalues of ID data's covariance matrix. EV then estimates the marginal contribution of each data point to this discrepancy via perturbation theory, alleviating the computational burden. Subsequently, EV plugs into ID loss-based methods by adding an EV term without any additional training loop. We demonstrate that EV achieves improved OOD robustness and stable value rankings across real-world datasets, while remaining computationally lightweight. These results indicate that EV is practical for large-scale settings with domain shift, offering an efficient path to OOD-robust data valuation.

Eigen-Value: Efficient Domain-Robust Data Valuation via Eigenvalue-Based Approach

TL;DR

This work tackles data-centric robustness under domain shift by introducing Eigen-Value (EV), a plug-and-play data valuation term that captures OOD risk through eigenvalue shifts of the data covariance derived from ID data. EV uses perturbation theory to approximate how removing a single point alters the top and bottom eigenvalues, enabling estimation of a point's marginal contribution to domain discrepancy without requiring OOD samples. By bounding domain discrepancy with the ratio of Hessian eigenvalues and expressing it in terms of the data covariance, EV provides a scalable, model-agnostic component that can be added to existing ID-based valuations. Empirical results across diverse vision and language datasets show that EV improves OOD robustness and stabilizes value rankings with minimal computational overhead, making it practical for large-scale data markets and continual learning. Overall, EV shifts the focus from model-centric to data-centric OOD robustness by leveraging spectral properties to guide informative data selection and pricing under domain shift.

Abstract

Data valuation has become central in the era of data-centric AI. It drives efficient training pipelines and enables objective pricing in data markets by assigning a numeric value to each data point. Most existing data valuation methods estimate the effect of removing individual data points by evaluating changes in model validation performance under in-distribution (ID) settings, as opposed to out-of-distribution (OOD) scenarios where data follow different patterns. Since ID and OOD data behave differently, data valuation methods based on ID loss often fail to generalize to OOD settings, particularly when the validation set contains no OOD data. Furthermore, although OOD-aware methods exist, they involve heavy computational costs, which hinder practical deployment. To address these challenges, we introduce \emph{Eigen-Value} (EV), a plug-and-play data valuation framework for OOD robustness that uses only an ID data subset, including during validation. EV provides a new spectral approximation of domain discrepancy, which is the gap of loss between ID and OOD using ratios of eigenvalues of ID data's covariance matrix. EV then estimates the marginal contribution of each data point to this discrepancy via perturbation theory, alleviating the computational burden. Subsequently, EV plugs into ID loss-based methods by adding an EV term without any additional training loop. We demonstrate that EV achieves improved OOD robustness and stable value rankings across real-world datasets, while remaining computationally lightweight. These results indicate that EV is practical for large-scale settings with domain shift, offering an efficient path to OOD-robust data valuation.

Paper Structure

This paper contains 40 sections, 3 theorems, 34 equations, 9 figures, 3 tables.

Key Result

Proposition 1

We assume $\mathcal{L}_{\textnormal{ID}}\le\mathcal{L}_{\textnormal{OOD}}$ under the NCE loss function. Then, the domain discrepancy $\Gamma(\mathcal{D}_{\textnormal{OOD}}, \mathcal{D}_{\textnormal{ID}})$ is bounded as follows:

Figures (9)

  • Figure 1: Overview of EV. Estimating the change in covariance eigenvalues induced by removing a single normalized embedding to quantify domain discrepancy, which is then integrated into ID loss-based data valuation for improved OOD robustness.
  • Figure 2: PCA visualization of normalized embeddings sampled (1K each) from different domain sources. The two distributions, corresponding to different domains, partially overlap due to normalization, illustrating that the matching marginal assumption remains applicable in real-world scenarios.
  • Figure 3: Relation between approximation values $u^\top_\text{max}\Delta_ku_\text{max}$ in Eq. \ref{['eq:eign_of_pertubation_term']} and real values $\lambda_\text{max}(\Sigma_{-k}) - \lambda_\text{max}(\Sigma_\textnormal{ID})$ for CIFAR-10, ImageNet, Amazon Reviews - Books and DomainNet - Real embedding datasets. This demonstrates that eigenvalue differences can be accurately approximated using our proposed method, highlighting its effectiveness in capturing spectral variations.
  • Figure 4: Performance comparison on OOD dataset, adding the highest data value of the remaining set. The hatched bars represent the performance of other methods when EV is applied. Results show that adding EV improves performance and enhances the robustness to OOD data. It highlights how selecting data based on our valuation approach can guide data inclusion in continual or online learning scenarios, where identifying the most beneficial data is crucial.
  • Figure 5: Stability under training-set perturbations. We conduct a valuation on 300 CIFAR-10 samples five times, keeping 290 fixed and resampling 10. Deviation’s data-value ranking exhibits a standard deviation comparable to random selection, whereas EV yields stable, efficient rankings while retaining ID-based valuation and strong OOD performance.
  • ...and 4 more figures

Theorems & Definitions (3)

  • Proposition 1
  • Theorem 1
  • Theorem 2