Table of Contents
Fetching ...

Dataset Condensation with Latent Quantile Matching

Wei Wei, Tom De Schepper, Kevin Mets

TL;DR

This work identifies limitations of Maximum Mean Discrepancy (MMD) for distribution matching in dataset condensation (DC) and introduces Latent Quantile Matching (LQM), which leverages the Cramér-von Mises statistic to align quantiles of latent embeddings. By matching optimal quantiles rather than only means, LQM better captures higher-order distributional structure and penalizes outliers, and can be plugged into existing DM-based DC pipelines. Empirical results on image and graph data show that LQM often outperforms MMD-based approaches, with notable gains under tight memory budgets and in continual graph learning (CGL). The findings suggest LQM as a practical, scalable enhancement for DC, improving training efficiency and privacy while maintaining high accuracy.

Abstract

Dataset condensation (DC) methods aim to learn a smaller synthesized dataset with informative data records to accelerate the training of machine learning models. Current distribution matching (DM) based DC methods learn a synthesized dataset by matching the mean of the latent embeddings between the synthetic and the real dataset. However two distributions with the same mean can still be vastly different. In this work we demonstrate the shortcomings of using Maximum Mean Discrepancy to match latent distributions i.e. the weak matching power and lack of outlier regularization. To alleviate these shortcomings we propose our new method: Latent Quantile Matching (LQM) which matches the quantiles of the latent embeddings to minimize the goodness of fit test statistic between two distributions. Empirical experiments on both image and graph-structured datasets show that LQM matches or outperforms previous state of the art in distribution matching based DC. Moreover we show that LQM improves the performance in continual graph learning (CGL) setting where memory efficiency and privacy can be important. Our work sheds light on the application of DM based DC for CGL.

Dataset Condensation with Latent Quantile Matching

TL;DR

This work identifies limitations of Maximum Mean Discrepancy (MMD) for distribution matching in dataset condensation (DC) and introduces Latent Quantile Matching (LQM), which leverages the Cramér-von Mises statistic to align quantiles of latent embeddings. By matching optimal quantiles rather than only means, LQM better captures higher-order distributional structure and penalizes outliers, and can be plugged into existing DM-based DC pipelines. Empirical results on image and graph data show that LQM often outperforms MMD-based approaches, with notable gains under tight memory budgets and in continual graph learning (CGL). The findings suggest LQM as a practical, scalable enhancement for DC, improving training efficiency and privacy while maintaining high accuracy.

Abstract

Dataset condensation (DC) methods aim to learn a smaller synthesized dataset with informative data records to accelerate the training of machine learning models. Current distribution matching (DM) based DC methods learn a synthesized dataset by matching the mean of the latent embeddings between the synthetic and the real dataset. However two distributions with the same mean can still be vastly different. In this work we demonstrate the shortcomings of using Maximum Mean Discrepancy to match latent distributions i.e. the weak matching power and lack of outlier regularization. To alleviate these shortcomings we propose our new method: Latent Quantile Matching (LQM) which matches the quantiles of the latent embeddings to minimize the goodness of fit test statistic between two distributions. Empirical experiments on both image and graph-structured datasets show that LQM matches or outperforms previous state of the art in distribution matching based DC. Moreover we show that LQM improves the performance in continual graph learning (CGL) setting where memory efficiency and privacy can be important. Our work sheds light on the application of DM based DC for CGL.
Paper Structure (14 sections, 8 equations, 8 figures, 6 tables)

This paper contains 14 sections, 8 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: The empirical cumulative distribution function (ECDF) of a latent feature of class 0 in CIFAR-10 after 1900 epochs of training.
  • Figure 2: Distribution Matching + Latent Quantile Matching
  • Figure 3: Comparison of average Cramér-von Mises stats between the synthetic dataset and the real dataset for each latent feature distribution. Lower number denotes higher probabilities that the compared samples (synthetic and real dataset in latent space) are drawn from the same distribution. The latent features are extracted by a pretrained model on the synthetic dataset.
  • Figure 4: The average percentage of extreme latent values for each class. A value of one denotes that one percent of the latent features in each class in the synthetic dataset exceed the maximum or dropped below the minimum of the corresponding class latent distribution in the real dataset. The latent features are extracted by a pretrained model on the synthetic dataset.
  • Figure 5: Synthetic image dataset learned by IDM+LQM on CIFAR10 with 10 image per class, each row corresponds to a class.
  • ...and 3 more figures