Table of Contents
Fetching ...

Scaling Down Deep Learning with MNIST-1D

Sam Greydanus, Dmitry Kobak

TL;DR

MNIST-1D offers a $40$-dimensional, procedurally generated toy dataset with default $4000$ training and $1000$ test samples to study core deep-learning phenomena on modest hardware. The paper demonstrates that MNIST-1D differentiates model inductive biases (CNNs/GRUs outperforming MLPs), enables rapid research on lottery tickets, deep double descent, gradient-based meta-learning, activation-function meta-learning, and self-supervised learning, and allows analysis of pooling effects, all within minutes to an hour of computation. By emphasizing controlled, small-scale experiments, it advocates a scaling-down manifesto to improve interpretability, reproducibility, and environmental sustainability while informing when and how to scale up. Overall, MNIST-1D serves as a practical, high-signal testbed for causal analysis and fast prototyping that complements large-scale investigations.

Abstract

Although deep learning models have taken on commercial and political relevance, key aspects of their training and operation remain poorly understood. This has sparked interest in science of deep learning projects, many of which require large amounts of time, money, and electricity. But how much of this research really needs to occur at scale? In this paper, we introduce MNIST-1D: a minimalist, procedurally generated, low-memory, and low-compute alternative to classic deep learning benchmarks. Although the dimensionality of MNIST-1D is only 40 and its default training set size only 4000, MNIST-1D can be used to study inductive biases of different deep architectures, find lottery tickets, observe deep double descent, metalearn an activation function, and demonstrate guillotine regularization in self-supervised learning. All these experiments can be conducted on a GPU or often even on a CPU within minutes, allowing for fast prototyping, educational use cases, and cutting-edge research on a low budget.

Scaling Down Deep Learning with MNIST-1D

TL;DR

MNIST-1D offers a -dimensional, procedurally generated toy dataset with default training and test samples to study core deep-learning phenomena on modest hardware. The paper demonstrates that MNIST-1D differentiates model inductive biases (CNNs/GRUs outperforming MLPs), enables rapid research on lottery tickets, deep double descent, gradient-based meta-learning, activation-function meta-learning, and self-supervised learning, and allows analysis of pooling effects, all within minutes to an hour of computation. By emphasizing controlled, small-scale experiments, it advocates a scaling-down manifesto to improve interpretability, reproducibility, and environmental sustainability while informing when and how to scale up. Overall, MNIST-1D serves as a practical, high-signal testbed for causal analysis and fast prototyping that complements large-scale investigations.

Abstract

Although deep learning models have taken on commercial and political relevance, key aspects of their training and operation remain poorly understood. This has sparked interest in science of deep learning projects, many of which require large amounts of time, money, and electricity. But how much of this research really needs to occur at scale? In this paper, we introduce MNIST-1D: a minimalist, procedurally generated, low-memory, and low-compute alternative to classic deep learning benchmarks. Although the dimensionality of MNIST-1D is only 40 and its default training set size only 4000, MNIST-1D can be used to study inductive biases of different deep architectures, find lottery tickets, observe deep double descent, metalearn an activation function, and demonstrate guillotine regularization in self-supervised learning. All these experiments can be conducted on a GPU or often even on a CPU within minutes, allowing for fast prototyping, educational use cases, and cutting-edge research on a low budget.

Paper Structure

This paper contains 18 sections, 11 figures, 2 tables.

Figures (11)

  • Figure 1: Constructing the MNIST-1D dataset. Unlike MNIST, each sample is a one-dimensional sequence. To generate each sample, we begin with a hand-crafted digit template loosely inspired by MNIST shapes. Then we randomly pad, translate, and add noise to produce 1D sequences with 40 points each. https://github.com/greydanus/mnist1d/blob/master/notebooks/building-mnist1d.ipynb
  • Figure 2: Train and test accuracy of common classification models on MNIST-1D. The logistic regression model fares worse than the MLP. Meanwhile, the MLP fares worse than the CNN and GRU, which use translation invariance and local connectivity to bias optimization towards solutions that generalize well. When local spatial correlations are destroyed by shuffling feature indices (dashed lines), the MLP performs the best. CPU runtime: $\sim$10 minutes.https://github.com/greydanus/mnist1d/blob/master/notebooks/mnist1d-classification.ipynb
  • Figure 3: Visualizing the MNIST and MNIST-1D datasets with $t$-SNE. The well-defined clusters in the MNIST embedding indicate that the classes are separable via a simple $k$NN classifier in pixel space. The MNIST-1D plot reveals little structure and a lack of clusters, indicating that nearest neighbors in pixel space are not semantically meaningful, as is the case with natural image datasets. https://github.com/greydanus/mnist1d/blob/master/notebooks/tsne-mnist-vs-mnist1d.ipynb
  • Figure 4: Finding and analyzing lottery tickets. (a--b) The test loss and test accuracy of lottery tickets at different levels of sparsity, compared to randomly selected subnetworks and to the original dense network. (c) Performance of lottery tickets with 92% sparsity. (d) Performance of the same lottery tickets when trained on flipped data. (e) Performance of the same lottery tickets when trained on data with shuffled features. (f) Performance of the same lottery tickets but with randomly initialized weights, when trained on original data. (g) Lottery tickets had more adjacent non-zero weights in the first layer compared to random subnetworks. Runtime: $\sim$30 minutes. https://github.com/greydanus/mnist1d/blob/master/notebooks/lottery-tickets.ipynb
  • Figure 5: Deep double descent in MNIST-1D classification. Here the test set had $12\,000$ samples. (a) MLP classifier with one hidden layer. (b) MLP classifier; 15% label noise. (c) CNN classifier with three convolutional layers; 15% label noise. Adapted with permission from prince2023understanding. CPU runtime: $\sim$60 minutes. https://github.com/greydanus/mnist1d/blob/master/notebooks/deep-double-descent.ipynb
  • ...and 6 more figures