Table of Contents
Fetching ...

Stable Coresets via Posterior Sampling: Aligning Induced and Full Loss Landscapes

Wei-Kai Chang, Rajiv Khanna

TL;DR

The paper tackles the coreset selection problem for deep learning, showing gradient-matching coresets can produce loss landscape misalignment under label noise. It introduces a posterior-smoothing framework that perturbs model weights with a Gaussian posterior and uses these samples to guide coreset selection, yielding a smoothed, landscape-aligned objective without explicit Hessian calculations. The authors prove that posterior sampling improves gradient and Hessian alignment and derive convergence guarantees for Mini-batch SGD on smoothed coresets, with a convergence rate of $O(1/\sqrt{MRT})$ under certain noise models. Experiments across vision and NLP show faster training, stronger generalization, and robustness to label corruption, achieving up to $20\%-200\%$ speedups and outperformance across SNLI, TinyImageNet, ImageNet-1k, CIFAR-100/10, MNIST, and more.

Abstract

As deep learning models continue to scale, the growing computational demands have amplified the need for effective coreset selection techniques. Coreset selection aims to accelerate training by identifying small, representative subsets of data that approximate the performance of the full dataset. Among various approaches, gradient based methods stand out due to their strong theoretical underpinnings and practical benefits, particularly under limited data budgets. However, these methods face challenges such as naive stochastic gradient descent (SGD) acting as a surprisingly strong baseline and the breakdown of representativeness due to loss curvature mismatches over time. In this work, we propose a novel framework that addresses these limitations. First, we establish a connection between posterior sampling and loss landscapes, enabling robust coreset selection even in high data corruption scenarios. Second, we introduce a smoothed loss function based on posterior sampling onto the model weights, enhancing stability and generalization while maintaining computational efficiency. We also present a novel convergence analysis for our sampling-based coreset selection method. Finally, through extensive experiments, we demonstrate how our approach achieves faster training and enhanced generalization across diverse datasets than the current state of the art.

Stable Coresets via Posterior Sampling: Aligning Induced and Full Loss Landscapes

TL;DR

The paper tackles the coreset selection problem for deep learning, showing gradient-matching coresets can produce loss landscape misalignment under label noise. It introduces a posterior-smoothing framework that perturbs model weights with a Gaussian posterior and uses these samples to guide coreset selection, yielding a smoothed, landscape-aligned objective without explicit Hessian calculations. The authors prove that posterior sampling improves gradient and Hessian alignment and derive convergence guarantees for Mini-batch SGD on smoothed coresets, with a convergence rate of under certain noise models. Experiments across vision and NLP show faster training, stronger generalization, and robustness to label corruption, achieving up to speedups and outperformance across SNLI, TinyImageNet, ImageNet-1k, CIFAR-100/10, MNIST, and more.

Abstract

As deep learning models continue to scale, the growing computational demands have amplified the need for effective coreset selection techniques. Coreset selection aims to accelerate training by identifying small, representative subsets of data that approximate the performance of the full dataset. Among various approaches, gradient based methods stand out due to their strong theoretical underpinnings and practical benefits, particularly under limited data budgets. However, these methods face challenges such as naive stochastic gradient descent (SGD) acting as a surprisingly strong baseline and the breakdown of representativeness due to loss curvature mismatches over time. In this work, we propose a novel framework that addresses these limitations. First, we establish a connection between posterior sampling and loss landscapes, enabling robust coreset selection even in high data corruption scenarios. Second, we introduce a smoothed loss function based on posterior sampling onto the model weights, enhancing stability and generalization while maintaining computational efficiency. We also present a novel convergence analysis for our sampling-based coreset selection method. Finally, through extensive experiments, we demonstrate how our approach achieves faster training and enhanced generalization across diverse datasets than the current state of the art.

Paper Structure

This paper contains 25 sections, 3 theorems, 72 equations, 10 figures, 10 tables, 1 algorithm.

Key Result

Theorem 3.2

Suppose a subset $S'\subset S$ is $(\sigma,\epsilon, w)$-stable and let the Hessian difference be $H_{S',w} - H_{S,w} =: \mathcal{E}$, then, (1) The Hessian difference matrix $\mathcal{E}$ satisfies: (2) The difference between newton step of two subset is bounded.

Figures (10)

  • Figure 1:
  • Figure 2:
  • Figure 3:
  • Figure 4:
  • Figure 6: (Left) Gradient match: The average gradient estimation error for our method and Craig selection method using LeNet on MNIST. The error is calculated with $\left| \frac{1}{|S|} \sum_{i \in S} \nabla l_i(w_t) - \frac{1}{|S'|} \sum_{j \in S'} \gamma_j \nabla l_j(w_t) \right|$, where $S$ is the training set and $S'$ is the subset selected. Our method generally produces smaller gradient errors and better gradient estimation compared to the Craig method. (Right) Memory buffer: The memory consumption for TinyImagenet. Our method shows less memory during the training process. Crest( yang2023sustainablelearningcoresetsdataefficient) requires to maintain the information of Hessian during training and the intermediate calculation such as checking the threshold calculation of Hessian norm also require large memory buffer. For the plot, we use running average to average over the time step and show the mean value.
  • ...and 5 more figures

Theorems & Definitions (4)

  • Theorem 3.2
  • Theorem 3.3
  • Theorem A.1
  • proof