Table of Contents
Fetching ...

Dataset Distillation from First Principles: Integrating Core Information Extraction and Purposeful Learning

Vyacheslav Kungurtsev, Yuanfang Peng, Jianyang Gu, Saeed Vahidian, Anthony Quinn, Fadwa Idlahcen, Yiran Chen

TL;DR

This work presents a formal model of DD, arguing that a precise characterization of the underlying optimization problem must specify the inference task associated with the application of interest, and reveals novel applications of DD across different modeling environments.

Abstract

Dataset distillation (DD) is an increasingly important technique that focuses on constructing a synthetic dataset capable of capturing the core information in training data to achieve comparable performance in models trained on the latter. While DD has a wide range of applications, the theory supporting it is less well evolved. New methods of DD are compared on a common set of benchmarks, rather than oriented towards any particular learning task. In this work, we present a formal model of DD, arguing that a precise characterization of the underlying optimization problem must specify the inference task associated with the application of interest. Without this task-specific focus, the DD problem is under-specified, and the selection of a DD algorithm for a particular task is merely heuristic. Our formalization reveals novel applications of DD across different modeling environments. We analyze existing DD methods through this broader lens, highlighting their strengths and limitations in terms of accuracy and faithfulness to optimal DD operation. Finally, we present numerical results for two case studies important in contemporary settings. Firstly, we address a critical challenge in medical data analysis: merging the knowledge from different datasets composed of intersecting, but not identical, sets of features, in order to construct a larger dataset in what is usually a small sample setting. Secondly, we consider out-of-distribution error across boundary conditions for physics-informed neural networks (PINNs), showing the potential for DD to provide more physically faithful data. By establishing this general formulation of DD, we aim to establish a new research paradigm by which DD can be understood and from which new DD techniques can arise.

Dataset Distillation from First Principles: Integrating Core Information Extraction and Purposeful Learning

TL;DR

This work presents a formal model of DD, arguing that a precise characterization of the underlying optimization problem must specify the inference task associated with the application of interest, and reveals novel applications of DD across different modeling environments.

Abstract

Dataset distillation (DD) is an increasingly important technique that focuses on constructing a synthetic dataset capable of capturing the core information in training data to achieve comparable performance in models trained on the latter. While DD has a wide range of applications, the theory supporting it is less well evolved. New methods of DD are compared on a common set of benchmarks, rather than oriented towards any particular learning task. In this work, we present a formal model of DD, arguing that a precise characterization of the underlying optimization problem must specify the inference task associated with the application of interest. Without this task-specific focus, the DD problem is under-specified, and the selection of a DD algorithm for a particular task is merely heuristic. Our formalization reveals novel applications of DD across different modeling environments. We analyze existing DD methods through this broader lens, highlighting their strengths and limitations in terms of accuracy and faithfulness to optimal DD operation. Finally, we present numerical results for two case studies important in contemporary settings. Firstly, we address a critical challenge in medical data analysis: merging the knowledge from different datasets composed of intersecting, but not identical, sets of features, in order to construct a larger dataset in what is usually a small sample setting. Secondly, we consider out-of-distribution error across boundary conditions for physics-informed neural networks (PINNs), showing the potential for DD to provide more physically faithful data. By establishing this general formulation of DD, we aim to establish a new research paradigm by which DD can be understood and from which new DD techniques can arise.
Paper Structure (31 sections, 49 equations, 7 figures, 1 table, 2 algorithms)

This paper contains 31 sections, 49 equations, 7 figures, 1 table, 2 algorithms.

Figures (7)

  • Figure 1: (a) We generate datasets by sampling from a randomly initailized DBN. To mimic heterogeneity of medical data, we split the sampled clean data to $K$ partitions. For each partition, we hide the values of different variables and add Gaussian noise to each entry. (b) To distill the synthetic data more effectively, we initialize $D^{S}$ from each source of training data. During each epoch, we sample a part $D_{j}^{S}$ from $D^{S}$, add Gaussian noise $\delta$ to it, learn two different DBNs from them respectively. Then retrieve training data $\left\{ D_i^T \mid i \ne j \right\}$. Finally we compute the log-likelihood of $D_{j}^{S}$ in two different DBNs as the evaluation score, to calculate the gradient with respect to $D_{j}^{S}$.
  • Figure 2: Performance of our algorithm in 4 different IPC settings. Across 4 settings, our algorithm is much better than fully-observed clean data $D^{sub}$, especially in low-data regime. When IPC=100, our algorithm is almost as effective as the complete and clean training dataset.
  • Figure 3: The Log-Likelihood (LL) curves in testing set $D^{Test}$ of 4 different IPC settings. We evaluate the synthetic dataset $\hat{D}$ after each distillation process and report LL, with comparison to $D^{sub}$
  • Figure 4: Performance of our algorithm when scaling up time-slices from 2 to 10. Across 4 IPC settings, our algorithm is much better than baseline.
  • Figure 5: Training and test dataset generation of PINN example. To demonstrate the out-of-distribution generalization ability of our algorithm, boundary conditions ($\alpha$) of training and test dataset are sampled from two opposite tails of normal distribution. For training dataset, noise is added to ground truth $y_i$ to mimic measurement error, while for test dataset, none noise is added to $y_i$
  • ...and 2 more figures