Path-Guided Flow Matching for Dataset Distillation
Xuhui Li, Zhengquan Luo, Xiwei Liu, Yongqiang Yu, Zhiqiang Xu
TL;DR
Dataset distillation seeks to compress large datasets into small, representative sets without sacrificing performance. This work introduces Path-Guided Flow Matching (PGFM), the first flow-matching-based framework for generative dataset distillation, operating in a frozen VAE latent space and solving an ODE with few steps. PGFM adds lightweight prototype guidance to steer trajectories toward diverse class prototypes while employing warm-start and trust-region constraints to preserve detail, achieving strong performance with dramatically reduced computation. Across high-resolution benchmarks, PGFM matches or surpasses diffusion-based methods with significantly higher efficiency and improved mode coverage, illustrating flow matching as a practical alternative for scalable dataset distillation.
Abstract
Dataset distillation compresses large datasets into compact synthetic sets with comparable performance in training models. Despite recent progress on diffusion-based distillation, this type of method typically depends on heuristic guidance or prototype assignment, which comes with time-consuming sampling and trajectory instability and thus hurts downstream generalization especially under strong control or low IPC. We propose \emph{Path-Guided Flow Matching (PGFM)}, the first flow matching-based framework for generative distillation, which enables fast deterministic synthesis by solving an ODE in a few steps. PGFM conducts flow matching in the latent space of a frozen VAE to learn class-conditional transport from Gaussian noise to data distribution. Particularly, we develop a continuous path-to-prototype guidance algorithm for ODE-consistent path control, which allows trajectories to reliably land on assigned prototypes while preserving diversity and efficiency. Extensive experiments across high-resolution benchmarks demonstrate that PGFM matches or surpasses prior diffusion-based distillation approaches with fewer steps of sampling while delivering competitive performance with remarkably improved efficiency, e.g., 7.6$\times$ more efficient than the diffusion-based counterparts with 78\% mode coverage.
