Table of Contents
Fetching ...

Finding the Muses: Identifying Coresets through Loss Trajectories

Manish Nagaraj, Deepak Ravikumar, Efstathia Soufleri, Kaushik Roy

TL;DR

This work introduces Loss Trajectory Correlation (LTC), a scalable metric that correlates training-sample loss trajectories with validation-loss trajectories to identify influential data for constructing coresets. By leveraging loss dynamics captured during standard training and using Pearson correlation, LTC enables architecture-agnostic core selection with low computational and storage overhead. Empirical results on CIFAR-100 and ImageNet-1k show LTC matches or surpasses state-of-the-art coreset methods across core sizes and transfers effectively across diverse architectures (e.g., ResNet, VGG, DenseNet, Swin Transformer) with minimal degradation. In addition to efficient coreset construction, LTC provides insights into training dynamics, distinguishing aligned and conflicting sample behaviors and supporting broader dataset optimization efforts.

Abstract

Deep learning models achieve state-of-the-art performance across domains but face scalability challenges in real-time or resource-constrained scenarios. To address this, we propose Loss Trajectory Correlation (LTC), a novel metric for coreset selection that identifies critical training samples driving generalization. $LTC$ quantifies the alignment between training sample loss trajectories and validation set loss trajectories, enabling the construction of compact, representative subsets. Unlike traditional methods with computational and storage overheads that are infeasible to scale to large datasets, $LTC$ achieves superior efficiency as it can be computed as a byproduct of training. Our results on CIFAR-100 and ImageNet-1k show that $LTC$ consistently achieves accuracy on par with or surpassing state-of-the-art coreset selection methods, with any differences remaining under 1%. LTC also effectively transfers across various architectures, including ResNet, VGG, DenseNet, and Swin Transformer, with minimal performance degradation (<2%). Additionally, LTC offers insights into training dynamics, such as identifying aligned and conflicting sample behaviors, at a fraction of the computational cost of traditional methods. This framework paves the way for scalable coreset selection and efficient dataset optimization.

Finding the Muses: Identifying Coresets through Loss Trajectories

TL;DR

This work introduces Loss Trajectory Correlation (LTC), a scalable metric that correlates training-sample loss trajectories with validation-loss trajectories to identify influential data for constructing coresets. By leveraging loss dynamics captured during standard training and using Pearson correlation, LTC enables architecture-agnostic core selection with low computational and storage overhead. Empirical results on CIFAR-100 and ImageNet-1k show LTC matches or surpasses state-of-the-art coreset methods across core sizes and transfers effectively across diverse architectures (e.g., ResNet, VGG, DenseNet, Swin Transformer) with minimal degradation. In addition to efficient coreset construction, LTC provides insights into training dynamics, distinguishing aligned and conflicting sample behaviors and supporting broader dataset optimization efforts.

Abstract

Deep learning models achieve state-of-the-art performance across domains but face scalability challenges in real-time or resource-constrained scenarios. To address this, we propose Loss Trajectory Correlation (LTC), a novel metric for coreset selection that identifies critical training samples driving generalization. quantifies the alignment between training sample loss trajectories and validation set loss trajectories, enabling the construction of compact, representative subsets. Unlike traditional methods with computational and storage overheads that are infeasible to scale to large datasets, achieves superior efficiency as it can be computed as a byproduct of training. Our results on CIFAR-100 and ImageNet-1k show that consistently achieves accuracy on par with or surpassing state-of-the-art coreset selection methods, with any differences remaining under 1%. LTC also effectively transfers across various architectures, including ResNet, VGG, DenseNet, and Swin Transformer, with minimal performance degradation (<2%). Additionally, LTC offers insights into training dynamics, such as identifying aligned and conflicting sample behaviors, at a fraction of the computational cost of traditional methods. This framework paves the way for scalable coreset selection and efficient dataset optimization.

Paper Structure

This paper contains 42 sections, 37 equations, 6 figures, 6 tables, 1 algorithm.

Figures (6)

  • Figure 1: Examples of loss trajectories of train and query sample pairs with high positive and high negative $\mathtt{LTC}$: (a) shows high positive $\mathtt{LTC}$, where train (solid blue) and query (dashed green) sample losses decrease together, indicating aligned learning dynamics. (b) shows high negative $\mathtt{LTC}$, where reductions in the train sample’s loss correspond to increases in the query sample’s loss, highlighting conflicting relationships. These examples demonstrate how $\mathtt{LTC}$ captures inter-sample influences during training. (Best viewed in color.)
  • Figure 2: Examples of randomly chosen query samples from the test set in ImageNet-1k and their corresponding training samples within the same class with the highest positive $\mathtt{LTC}$ (closely aligned learning dynamics) and highest negative $\mathtt{LTC}$ (conflicting learning dynamics). This visualization highlights the alignment and contrast between query samples and training samples in terms of their feature relevance during the learning process. The $\mathtt{LTC}$ values were calculated using ResNet-18.
  • Figure 3: Comparison of test accuracy across various coreset selection methods on the CIFAR-100 and ImageNet-1k datasets. The proposed approach consistently outperforms or matches existing techniques for all evaluated dataset sizes, demonstrating its effectiveness. The shaded regions in the plots represent the standard deviation across five random seeds. (Best viewed in color.)
  • Figure 4: Comparison of the performance of coresets of different sizes of ImageNet-1k on different target models (shown in different colors) identified by the same architecture as the source model (shown as striped bars) to the coresets identified by a different (smaller) source model, ResNet-18 (shown as dotted bars). The error bars show the variance of the performance over 5 runs. We see a very minimal drop in accuracy when the smaller source models is used. (Best viewed in color.)
  • Figure 5: Linear datamodeling scores (LDS) of existing TDA metrics compared to $\mathtt{LTC}$. The scores were evaluated on CIFAR-10, ResNet-9 with 200 (randomly selected) query samples evaluated over 100 subsets.
  • ...and 1 more figures