Finding the Muses: Identifying Coresets through Loss Trajectories
Manish Nagaraj, Deepak Ravikumar, Efstathia Soufleri, Kaushik Roy
TL;DR
This work introduces Loss Trajectory Correlation (LTC), a scalable metric that correlates training-sample loss trajectories with validation-loss trajectories to identify influential data for constructing coresets. By leveraging loss dynamics captured during standard training and using Pearson correlation, LTC enables architecture-agnostic core selection with low computational and storage overhead. Empirical results on CIFAR-100 and ImageNet-1k show LTC matches or surpasses state-of-the-art coreset methods across core sizes and transfers effectively across diverse architectures (e.g., ResNet, VGG, DenseNet, Swin Transformer) with minimal degradation. In addition to efficient coreset construction, LTC provides insights into training dynamics, distinguishing aligned and conflicting sample behaviors and supporting broader dataset optimization efforts.
Abstract
Deep learning models achieve state-of-the-art performance across domains but face scalability challenges in real-time or resource-constrained scenarios. To address this, we propose Loss Trajectory Correlation (LTC), a novel metric for coreset selection that identifies critical training samples driving generalization. $LTC$ quantifies the alignment between training sample loss trajectories and validation set loss trajectories, enabling the construction of compact, representative subsets. Unlike traditional methods with computational and storage overheads that are infeasible to scale to large datasets, $LTC$ achieves superior efficiency as it can be computed as a byproduct of training. Our results on CIFAR-100 and ImageNet-1k show that $LTC$ consistently achieves accuracy on par with or surpassing state-of-the-art coreset selection methods, with any differences remaining under 1%. LTC also effectively transfers across various architectures, including ResNet, VGG, DenseNet, and Swin Transformer, with minimal performance degradation (<2%). Additionally, LTC offers insights into training dynamics, such as identifying aligned and conflicting sample behaviors, at a fraction of the computational cost of traditional methods. This framework paves the way for scalable coreset selection and efficient dataset optimization.
