Dataset Distillation Efficiently Encodes Low-Dimensional Representations from Gradient-Based Learning of Non-Linear Tasks

Yuri Kinoshita; Naoki Nishikawa; Taro Toyoizumi

Dataset Distillation Efficiently Encodes Low-Dimensional Representations from Gradient-Based Learning of Non-Linear Tasks

Yuri Kinoshita, Naoki Nishikawa, Taro Toyoizumi

Abstract

Dataset distillation, a training-aware data compression technique, has recently attracted increasing attention as an effective tool for mitigating costs of optimization and data storage. However, progress remains largely empirical. Mechanisms underlying the extraction of task-relevant information from the training process and the efficient encoding of such information into synthetic data points remain elusive. In this paper, we theoretically analyze practical algorithms of dataset distillation applied to the gradient-based training of two-layer neural networks with width $L$. By focusing on a non-linear task structure called multi-index model, we prove that the low-dimensional structure of the problem is efficiently encoded into the resulting distilled data. This dataset reproduces a model with high generalization ability for a required memory complexity of $\tildeΘ$$(r^2d+L)$, where $d$ and $r$ are the input and intrinsic dimensions of the task. To the best of our knowledge, this is one of the first theoretical works that include a specific task structure, leverage its intrinsic dimensionality to quantify the compression rate and study dataset distillation implemented solely via gradient-based algorithms.

Dataset Distillation Efficiently Encodes Low-Dimensional Representations from Gradient-Based Learning of Non-Linear Tasks

Abstract

. By focusing on a non-linear task structure called multi-index model, we prove that the low-dimensional structure of the problem is efficiently encoded into the resulting distilled data. This dataset reproduces a model with high generalization ability for a required memory complexity of

, where

and

are the input and intrinsic dimensions of the task. To the best of our knowledge, this is one of the first theoretical works that include a specific task structure, leverage its intrinsic dimensionality to quantify the compression rate and study dataset distillation implemented solely via gradient-based algorithms.

Paper Structure (72 sections, 71 theorems, 308 equations, 3 figures, 5 tables, 5 algorithms)

This paper contains 72 sections, 71 theorems, 308 equations, 3 figures, 5 tables, 5 algorithms.

Introduction
Background
Contributions
Related Works
Organization
Notation
Theoretical Works of DD
Feature Learning with GD
Preliminaries
General Strategy
Distillation Algorithms $\mathcal{M}$
Roles of Batch Initialization and Fixed Reference
Problem Setting and Assumptions
Task Setup
Trained Model
...and 57 more sections

Key Result

Theorem 4.1

Under Assumptions as:target, as:target_H, as:init, as:alg, as:al_init and as:surrogate, we consider $\mathcal{D}_1^S$ with initializations $\{\tilde{x}_m^{(0)},\tilde{y}_m^{(0)}\}_{m=1}^{M_1}$ where $\|\tilde{x}_m^{(0)}\|\sim U(S^{d-1})$ and $\tilde{y}_m^{(0)}$ is some constant. Then, with high pro

Figures (3)

Figure 1: Dependence of training data size with $J^\ast =10^5$ (left figure) and initialization batch size with $N=10^5$ (right) with respect to the achieved MSE loss. Mean and standard deviation over five seeds.
Figure 2: MSE loss with respect to the training data size $n$ used to fine-tune a model pre-trained with data distilled from the earlier training of a function with the same principal subspace. $N$ and $J^\ast$ are the parameters for this training before. Mean and standard deviation over five seeds.
Figure 3: Reconstruction percentage when using $\mathcal{D}_2^S$ with different size ($1,10,50,100$) constructed following Lemma \ref{['lem:rank-certify']} and its compact variant with respect to the MSE of teacher training $t=2$ (MSE loss) and the maximal attainable rank $L^\ast+1$ (Rank), for $r=3$ (left) and $r=10$ (right). d was set to $100$.

Theorems & Definitions (145)

Definition 2.1
Definition 2.2
Remark 2.3
Remark 2.4
Remark 2.5
Definition 3.2
Definition 3.5
Theorem 4.1: Latent Structure Encoding
Theorem 4.2
Theorem 4.3
...and 135 more

Dataset Distillation Efficiently Encodes Low-Dimensional Representations from Gradient-Based Learning of Non-Linear Tasks

Abstract

Dataset Distillation Efficiently Encodes Low-Dimensional Representations from Gradient-Based Learning of Non-Linear Tasks

Authors

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (145)