Table of Contents
Fetching ...

Dataset Meta-Learning from Kernel Ridge-Regression

Timothy Nguyen, Zhourong Chen, Jaehoon Lee

TL;DR

The paper addresses dataset efficiency by learning $\epsilon$-approximate datasets that preserve predictive performance. It introduces Kernel Inducing Points (KIP), a meta-learning method that optimizes a kernel ridge-regression objective to produce a compact support set, with a Label Solve variant. It demonstrates state-of-the-art results on MNIST and CIFAR-10 for kernel methods and neural-network distillation, with compression by one to two orders of magnitude and strong transfer across kernels and to neural nets. It also discusses privacy implications via corruption and potential privacy-preserving data sharing, and provides open-source code.

Abstract

One of the most fundamental aspects of any machine learning algorithm is the training data used by the algorithm. We introduce the novel concept of $ε$-approximation of datasets, obtaining datasets which are much smaller than or are significant corruptions of the original training data while maintaining similar model performance. We introduce a meta-learning algorithm called Kernel Inducing Points (KIP) for obtaining such remarkable datasets, inspired by the recent developments in the correspondence between infinitely-wide neural networks and kernel ridge-regression (KRR). For KRR tasks, we demonstrate that KIP can compress datasets by one or two orders of magnitude, significantly improving previous dataset distillation and subset selection methods while obtaining state of the art results for MNIST and CIFAR-10 classification. Furthermore, our KIP-learned datasets are transferable to the training of finite-width neural networks even beyond the lazy-training regime, which leads to state of the art results for neural network dataset distillation with potential applications to privacy-preservation.

Dataset Meta-Learning from Kernel Ridge-Regression

TL;DR

The paper addresses dataset efficiency by learning -approximate datasets that preserve predictive performance. It introduces Kernel Inducing Points (KIP), a meta-learning method that optimizes a kernel ridge-regression objective to produce a compact support set, with a Label Solve variant. It demonstrates state-of-the-art results on MNIST and CIFAR-10 for kernel methods and neural-network distillation, with compression by one to two orders of magnitude and strong transfer across kernels and to neural nets. It also discusses privacy implications via corruption and potential privacy-preserving data sharing, and provides open-source code.

Abstract

One of the most fundamental aspects of any machine learning algorithm is the training data used by the algorithm. We introduce the novel concept of -approximation of datasets, obtaining datasets which are much smaller than or are significant corruptions of the original training data while maintaining similar model performance. We introduce a meta-learning algorithm called Kernel Inducing Points (KIP) for obtaining such remarkable datasets, inspired by the recent developments in the correspondence between infinitely-wide neural networks and kernel ridge-regression (KRR). For KRR tasks, we demonstrate that KIP can compress datasets by one or two orders of magnitude, significantly improving previous dataset distillation and subset selection methods while obtaining state of the art results for MNIST and CIFAR-10 classification. Furthermore, our KIP-learned datasets are transferable to the training of finite-width neural networks even beyond the lazy-training regime, which leads to state of the art results for neural network dataset distillation with potential applications to privacy-preservation.

Paper Structure

This paper contains 19 sections, 3 theorems, 14 equations, 8 figures, 12 tables, 1 algorithm.

Key Result

Theorem 1

Let $D = (X_t, y_t) \in \mathbb{R}^{n_t \times d} \times \mathbb{R}^{n_t \times C}$ be an arbitrary dataset. Let $w_\lambda \in \mathbb{R}^{d \times C}$ be the coefficients obtained from training $\lambda$ ridge-regression ($\lambda$-RR) on $(X_t, y_t)$, as given by (eq:wRR).

Figures (8)

  • Figure 1: (a) Learned samples of CIFAR-10 using KIP and its variant $\texttt{KIP}_{\rho}$, for which $\rho$ fraction of the pixels are uniform noise. Using 1000 such images to train a 1 hidden layer fully connected network results in 49.2% and 45.0% CIFAR-10 test accuracy, respectively, whereas using 1000 original CIFAR-10 images results in 35.4% test accuracy. (b) Example of labels obtained by label solving (LS ) (left two) and the covariance matrix between original labels and learned labels (right). Here, 500 labels were distilled from the CIFAR-10 train dataset using the the Myrtle 10-layer convolutional network. A test accuracy of 69.7% is achieved using these labels for kernel ridge-regression.
  • Figure 2: LS performance for Myrtle-(5/10) and FC on CIFAR-10/Fashion-MNIST/MNIST. Results computed over 3 independent samples per support set size.
  • Figure 3: KIP learned images transfers well to finite neural networks. Test accuracy on CIFAR-10 comparing natural images (x-axis) and KIP -learned images (y-axis). Each scatter point corresponds to varying hyperparameters for training (e.g. learning rate). Top row are clean images, bottom row are 90% corrupted images. KIP images were trained using FC1-3, Conv1-2 kernels.
  • Figure A1: Studying transfer between kernels.
  • Figure A2: Label Solve transfer between Myrtle-10 and FC for CIFAR10. Top row: LS labels using Myrtle-10 applied to FC1. Bottom row: LS labels using FC1 applied to Myrtle-10. Results averaged over 3 samples per support set size. In all these plots, NNGP kernels were used and Myrtle-10 used regularized ZCA preprocessing.
  • ...and 3 more figures

Theorems & Definitions (11)

  • Definition 1
  • Definition 2
  • Definition 3
  • Definition 4
  • Definition 5
  • Theorem 1
  • proof
  • Theorem 2
  • proof
  • Theorem 3
  • ...and 1 more