Table of Contents
Fetching ...

Coresets from Trajectories: Selecting Data via Correlation of Loss Differences

Manish Nagaraj, Deepak Ravikumar, Kaushik Roy

TL;DR

This work introduces Correlation of Loss Differences (CLD), a scalable, gradient-free coreset construction method for deep learning. CLD ranks training samples by the Pearson correlation between each sample’s loss trajectory and the validation-loss trajectory, enabling class-balanced coresets that generalize well while avoiding gradient, Hessian, or embedding computations. The authors provide a convergence framework showing that high-CLD coresets achieve population-risk convergence close to full-data training, with an excess error governed by the alignment parameter $\kappa$ and validation representativeness $\delta$. Empirically, CLD matches or surpasses state-of-the-art baselines on CIFAR-100 and ImageNet-1k across subset sizes, transfers effectively across architectures, and remains stable under checkpoint subsampling, while enabling substantial reductions in compute and storage. The method also exhibits intrinsic bias reduction via per-class validation alignment and offers a solid foundation for extending principled data selection to broader supervised settings with robust validation design.

Abstract

Deep learning models achieve state-of-the-art performance across domains but face scalability challenges in real-time or resource-constrained scenarios. To address this, we propose Correlation of Loss Differences (CLD), a simple and scalable metric for coreset selection that identifies the most impactful training samples by measuring their alignment with the loss trajectories of a held-out validation set. CLD is highly efficient, requiring only per-sample loss values computed at training checkpoints, and avoiding the costly gradient and curvature computations used in many existing subset selection methods. We develop a general theoretical framework that establishes convergence guarantees for CLD-based coresets, demonstrating that the convergence error is upper-bounded by the alignment of the selected samples and the representativeness of the validation set. On CIFAR-100 and ImageNet-1k, CLD-based coresets typically outperform or closely match state-of-the-art methods across subset sizes, and remain within 1% of more computationally expensive baselines even when not leading. CLD transfers effectively across architectures (ResNet, VGG, DenseNet), enabling proxy-to-target selection with <1% degradation. Moreover, CLD is stable when using only early checkpoints, incurring negligible accuracy loss. Finally, CLD exhibits inherent bias reduction via per-class validation alignment, obviating the need for additional stratified sampling. Together, these properties make CLD a principled, efficient, stable, and transferable tool for scalable dataset optimization.

Coresets from Trajectories: Selecting Data via Correlation of Loss Differences

TL;DR

This work introduces Correlation of Loss Differences (CLD), a scalable, gradient-free coreset construction method for deep learning. CLD ranks training samples by the Pearson correlation between each sample’s loss trajectory and the validation-loss trajectory, enabling class-balanced coresets that generalize well while avoiding gradient, Hessian, or embedding computations. The authors provide a convergence framework showing that high-CLD coresets achieve population-risk convergence close to full-data training, with an excess error governed by the alignment parameter and validation representativeness . Empirically, CLD matches or surpasses state-of-the-art baselines on CIFAR-100 and ImageNet-1k across subset sizes, transfers effectively across architectures, and remains stable under checkpoint subsampling, while enabling substantial reductions in compute and storage. The method also exhibits intrinsic bias reduction via per-class validation alignment and offers a solid foundation for extending principled data selection to broader supervised settings with robust validation design.

Abstract

Deep learning models achieve state-of-the-art performance across domains but face scalability challenges in real-time or resource-constrained scenarios. To address this, we propose Correlation of Loss Differences (CLD), a simple and scalable metric for coreset selection that identifies the most impactful training samples by measuring their alignment with the loss trajectories of a held-out validation set. CLD is highly efficient, requiring only per-sample loss values computed at training checkpoints, and avoiding the costly gradient and curvature computations used in many existing subset selection methods. We develop a general theoretical framework that establishes convergence guarantees for CLD-based coresets, demonstrating that the convergence error is upper-bounded by the alignment of the selected samples and the representativeness of the validation set. On CIFAR-100 and ImageNet-1k, CLD-based coresets typically outperform or closely match state-of-the-art methods across subset sizes, and remain within 1% of more computationally expensive baselines even when not leading. CLD transfers effectively across architectures (ResNet, VGG, DenseNet), enabling proxy-to-target selection with <1% degradation. Moreover, CLD is stable when using only early checkpoints, incurring negligible accuracy loss. Finally, CLD exhibits inherent bias reduction via per-class validation alignment, obviating the need for additional stratified sampling. Together, these properties make CLD a principled, efficient, stable, and transferable tool for scalable dataset optimization.

Paper Structure

This paper contains 78 sections, 5 theorems, 110 equations, 8 figures, 6 tables, 1 algorithm.

Key Result

Theorem 1

Consider a gradient descent algorithm trained over $T$ iterations on a training dataset ${\mathbf{S}}$ with a held-out validation set ${\mathbf{V}}$. Given ass:L_smoothass:bounded_gradsass:validation_representativeness, let the learning rate satisfy $0<\eta\le 1/L$. Let $\theta_{{\mathbf{C}}}^t$ den guarantees that where $R_{\inf} := \inf_{\theta} R_{{\mathbf{D}}}(\theta)$, and $\kappa \ge 0$ is

Figures (8)

  • Figure 1: Correlation of Loss Differences (CLD) at a glance.Left:ImageNet‑1k "Tiger" samples illustrating varying CLD scores. High-CLD samples (top row) closely track the validation loss trajectory, indicating informative and representative data. Low/negative-CLD samples (bottom row) significantly deviate, typically corresponding to atypical, ambiguous, or mislabeled examples. Right: Performance comparison of coresets formed by selecting equal-sized subsets of the highest 10% positive, lowest 10% negative, and 10% zero-valued CLD samples of ImageNet-1k on ResNet-18. Coresets with high-positive CLD samples achieve superior accuracy over various seeds.
  • Figure 2: Test accuracy (mean over five seeds) for representative coreset selection methods on CIFAR-100 and ImageNet-1k with ResNet-18. $\mathtt{CLD}$ consistently matches or outperforms baselines across dataset sizes. (X-axis uses a non-uniform coreset-size grid.) Color map: blues (score-based), oranges (optimization-based), greens (training-property-based), black ($\mathtt{CLD}$). For reference, the full-data mean top-1 accuracy over five seeds is 70.95 on CIFAR-100 and 69.91 on ImageNet-1k. Complete numerical results are available in \ref{['appendix:coreset_gen']}.
  • Figure 3: Transferability of $\mathtt{CLD}$-based coresets across architectures on ImageNet-1k. Each subplot reports test accuracy (mean over five runs) for target models trained on coresets of varying sizes. Transfer coresets (dashed black, diamonds) are selected using ResNet-18; Oracle coresets (solid green, circles) are computed by the target itself. Transferred coresets are within $1\%$ of oracle coresets across all targets and sizes.
  • Figure 4: Efficiency summary: Accuracy vs. Compute (x-axis, log scale); bubble size is proportional to the selection-stage storage overhead. The plot uses an illustrative setup for concreteness: selecting 10% coresets of ImageNet-1k with a ResNet-18 proxy and training a ResNet-50 on the coreset (see \ref{['appendix:details_overheads_coresets']}). Both $\mathtt{CLD}_{90}$ (all proxy epochs) and $\mathtt{CLD}_{45}$ (first 45 proxy epochs) are shown; the latter achieves similar accuracy at roughly half the selection compute. A similar trend is observed for DUAL when restricted to early proxy epochs (which we discuss in \ref{['sec:Discussion']}), highlighting that early-epoch scoring can improve efficiency without harming performance.
  • Figure 5: Stability and bias. (a) $\mathtt{CLD}$ is stable under reduced temporal resolution; (b) external stratified sampling is unnecessary, and often harmful, for $\mathtt{CLD}$.
  • ...and 3 more figures

Theorems & Definitions (20)

  • Definition 1: Correlation of Loss Differences ($\mathtt{CLD}$)
  • Remark 1: Per-Class Validation Trajectories
  • Theorem 1: Convergence with $\mathtt{CLD}$-Coresets
  • proof : Proof Sketch
  • Corollary 1: Necessity of High $\mathtt{CLD}$ for Good Coresets
  • Lemma 1: High $\mathtt{CLD}$ Implies Gradient Alignment
  • proof : Proof Outline
  • proof
  • Remark 2: On Update Sequence Variation
  • Lemma 2: Stability of Gradient Alignment
  • ...and 10 more