Table of Contents
Fetching ...

LiveVal: Time-aware Data Valuation via Adaptive Reference Points

Jie Xu, Zihan Wu, Cong Wang, Xiaohua Jia

TL;DR

LiveVal tackles the challenge of time-aware data valuation during training by shifting evaluation from loss to parameter space. It seamlessly integrates with SGD, using adaptive reference points and normalization to compare data contributions across training stages. The framework provides theoretical guarantees on directional alignment, boundedness, and stability, and it demonstrates a 180× speedup over retraining-based baselines while preserving detection performance. Empirically, LiveVal detects harmful samples early across diverse modalities and scales to large models, making real-time data valuation practical for streaming data and large-scale systems.

Abstract

Time-aware data valuation enhances training efficiency and model robustness, as early detection of harmful samples could prevent months of wasted computation. However, existing methods rely on model retraining or convergence assumptions or fail to capture long-term training dynamics. We propose LiveVal, an efficient time-aware data valuation method with three key designs: 1) seamless integration with SGD training for efficient data contribution monitoring; 2) reference-based valuation with normalization for reliable benchmark establishment; and 3) adaptive reference point selection for real-time updating with optimized memory usage. We establish theoretical guarantees for LiveVal's stability and prove that its valuations are bounded and directionally aligned with optimization progress. Extensive experiments demonstrate that LiveVal provides efficient data valuation across different modalities and model scales, achieving 180 speedup over traditional methods while maintaining robust detection performance.

LiveVal: Time-aware Data Valuation via Adaptive Reference Points

TL;DR

LiveVal tackles the challenge of time-aware data valuation during training by shifting evaluation from loss to parameter space. It seamlessly integrates with SGD, using adaptive reference points and normalization to compare data contributions across training stages. The framework provides theoretical guarantees on directional alignment, boundedness, and stability, and it demonstrates a 180× speedup over retraining-based baselines while preserving detection performance. Empirically, LiveVal detects harmful samples early across diverse modalities and scales to large models, making real-time data valuation practical for streaming data and large-scale systems.

Abstract

Time-aware data valuation enhances training efficiency and model robustness, as early detection of harmful samples could prevent months of wasted computation. However, existing methods rely on model retraining or convergence assumptions or fail to capture long-term training dynamics. We propose LiveVal, an efficient time-aware data valuation method with three key designs: 1) seamless integration with SGD training for efficient data contribution monitoring; 2) reference-based valuation with normalization for reliable benchmark establishment; and 3) adaptive reference point selection for real-time updating with optimized memory usage. We establish theoretical guarantees for LiveVal's stability and prove that its valuations are bounded and directionally aligned with optimization progress. Extensive experiments demonstrate that LiveVal provides efficient data valuation across different modalities and model scales, achieving 180 speedup over traditional methods while maintaining robust detection performance.

Paper Structure

This paper contains 47 sections, 4 theorems, 47 equations, 3 figures, 5 tables, 4 algorithms.

Key Result

Theorem 1

For any data point $i$ and iteration $t$, the step-wise data value $v_i^t$ of LiveVal satisfies: 1) Directional Alignment: If the gradient update from data point $i$ moves the model parameters closer to the reference point $\boldsymbol{\theta}_{\text{ref}}^t$, then $v_i^t \geq 0$. 2) Value Boundedne

Figures (3)

  • Figure 1: Basic method using the final model parameter as the static reference point.
  • Figure 2: LiveVal's adaptive reference point mechanism.
  • Figure 3: Early Detection Performance of LiveVal. The curves show the number of identified corrupted samples in the first 5 training epochs for different corruption levels ($k=10,20,30,40$).

Theorems & Definitions (12)

  • Definition 1: Mini-batch SGD
  • Definition 2: Parameter Trajectory
  • Definition 3: Influence Function koh2017understanding
  • Definition 4: Step-wise Data Value
  • Definition 5: Cumulative Data Value
  • Theorem 1: Fundamental Properties of LiveVal
  • Theorem 2: Local Volatility Bound
  • Theorem 5: Fundamental Properties of LiveVal
  • proof
  • Definition 6: Data Valuation Volatility
  • ...and 2 more