Table of Contents
Fetching ...

FAST: Topology-Aware Frequency-Domain Distribution Matching for Coreset Selection

Jin Cui, Boran Zhao, Jiajun Xu, Jiaqi Guo, Shuo Guan, Pengju Ren

TL;DR

FAST tackles the problem of creating distributively equivalent coresets without relying on neural network proxies. It introduces a graph-theoretic, frequency-domain approach using Phase-Decoupled CFD to overcome phase-gradient issues and a curriculum-based PDAS to progressively align low- to high-frequency content, all while enforcing topology preservation through spectral graph constraints and DPP-based diversity. The combination yields state-of-the-art performance across diverse vision datasets and even scales to edge devices and LLM tuning, while drastically reducing energy consumption and computation time. This work demonstrates that distributional equivalence, preserved via spectral geometry and frequency-domain statistics, can robustly substitute architecture-specific training signals for coreset selection with broad practical impact.

Abstract

Coreset selection compresses large datasets into compact, representative subsets, reducing the energy and computational burden of training deep neural networks. Existing methods are either: (i) DNN-based, which are tied to model-specific parameters and introduce architectural bias; or (ii) DNN-free, which rely on heuristics lacking theoretical guarantees. Neither approach explicitly constrains distributional equivalence, largely because continuous distribution matching is considered inapplicable to discrete sampling. Moreover, prevalent metrics (e.g., MSE, KL, MMD, CE) cannot accurately capture higher-order moment discrepancies, leading to suboptimal coresets. In this work, we propose FAST, the first DNN-free distribution-matching coreset selection framework that formulates the coreset selection task as a graph-constrained optimization problem grounded in spectral graph theory and employs the Characteristic Function Distance (CFD) to capture full distributional information in the frequency domain. We further discover that naive CFD suffers from a "vanishing phase gradient" issue in medium and high-frequency regions; to address this, we introduce an Attenuated Phase-Decoupled CFD. Furthermore, for better convergence, we design a Progressive Discrepancy-Aware Sampling strategy that progressively schedules frequency selection from low to high, preserving global structure before refining local details and enabling accurate matching with fewer frequencies while avoiding overfitting. Extensive experiments demonstrate that FAST significantly outperforms state-of-the-art coreset selection methods across all evaluated benchmarks, achieving an average accuracy gain of 9.12%. Compared to other baseline coreset methods, it reduces power consumption by 96.57% and achieves a 2.2x average speedup, underscoring its high performance and energy efficiency.

FAST: Topology-Aware Frequency-Domain Distribution Matching for Coreset Selection

TL;DR

FAST tackles the problem of creating distributively equivalent coresets without relying on neural network proxies. It introduces a graph-theoretic, frequency-domain approach using Phase-Decoupled CFD to overcome phase-gradient issues and a curriculum-based PDAS to progressively align low- to high-frequency content, all while enforcing topology preservation through spectral graph constraints and DPP-based diversity. The combination yields state-of-the-art performance across diverse vision datasets and even scales to edge devices and LLM tuning, while drastically reducing energy consumption and computation time. This work demonstrates that distributional equivalence, preserved via spectral geometry and frequency-domain statistics, can robustly substitute architecture-specific training signals for coreset selection with broad practical impact.

Abstract

Coreset selection compresses large datasets into compact, representative subsets, reducing the energy and computational burden of training deep neural networks. Existing methods are either: (i) DNN-based, which are tied to model-specific parameters and introduce architectural bias; or (ii) DNN-free, which rely on heuristics lacking theoretical guarantees. Neither approach explicitly constrains distributional equivalence, largely because continuous distribution matching is considered inapplicable to discrete sampling. Moreover, prevalent metrics (e.g., MSE, KL, MMD, CE) cannot accurately capture higher-order moment discrepancies, leading to suboptimal coresets. In this work, we propose FAST, the first DNN-free distribution-matching coreset selection framework that formulates the coreset selection task as a graph-constrained optimization problem grounded in spectral graph theory and employs the Characteristic Function Distance (CFD) to capture full distributional information in the frequency domain. We further discover that naive CFD suffers from a "vanishing phase gradient" issue in medium and high-frequency regions; to address this, we introduce an Attenuated Phase-Decoupled CFD. Furthermore, for better convergence, we design a Progressive Discrepancy-Aware Sampling strategy that progressively schedules frequency selection from low to high, preserving global structure before refining local details and enabling accurate matching with fewer frequencies while avoiding overfitting. Extensive experiments demonstrate that FAST significantly outperforms state-of-the-art coreset selection methods across all evaluated benchmarks, achieving an average accuracy gain of 9.12%. Compared to other baseline coreset methods, it reduces power consumption by 96.57% and achieves a 2.2x average speedup, underscoring its high performance and energy efficiency.

Paper Structure

This paper contains 13 sections, 16 equations, 11 figures, 2 tables.

Figures (11)

  • Figure 1: The energy consumption of training exceeds the annual electricity usage of numerous households by several orders of magnitude.
  • Figure 2: Comparison of distribution alignment under different metrics in frequency domain (on complex-plane). MSE aligns the mean, KL aligns both mean and variance, while CFD captures complete distributional structures in the frequency domain.
  • Figure 3: Low frequencies (b) capture smooth shading and coarse shapes; high frequencies (c) capture edges and fine textures. Amplitude (d) (and corresponding spectrum (e)) encodes the energy distribution across frequencies, while phase (f) specifies the spatial arrangement of structures.
  • Figure 4: (a) Overview of proposed FAST. Graph-Structure-Aware Constraints (GSAC) preserves topological consistency, while Progressive Discrepancy-Aware Sampling (PDAS) progressively aligns distributions via phase-decoupled characteristic function distance (PD-CFD). (b)Graph Decoder. Maps the optimized coreset back to the original data space, ensuring structural consistency. (c) Graph Encoder. Constructs the graph topology based-on spectral graph theory.
  • Figure 5: Relationship between downstream training accuracy and distributional equivalence. Results indicate that enforcing distributional equivalence leads to improved performance.
  • ...and 6 more figures