Table of Contents
Fetching ...

Scaling Backwards: Minimal Synthetic Pre-training?

Ryo Nakamura, Ryu Tadokoro, Ryosuke Yamada, Yuki M. Asano, Iro Laina, Christian Rupprecht, Nakamasa Inoue, Rio Yokota, Hirokatsu Kataoka

TL;DR

This work questions the necessity of large-scale real-image pre-training by introducing 1p-frac, a minimal synthetic pre-training dataset derived from a single fractal. It pairs a locally perturbed cross entropy (LPCE) loss with a locally integrated empirical (LIEP) distribution to train a model from just one fractal image, yet achieves downstream performance comparable to ImageNet-1k pre-training in full fine-tuning. Key findings show that the shape perturbations encoded by fractal geometry are crucial for learning strong representations, that appropriate perturbation scales are necessary, and that complex fractal shapes outperform simple ones; strikingly, even a single fractal can outperform larger synthetic datasets under certain conditions. The study also demonstrates that with real images, grayscale contour-like representations plus affine transformations can reproduce the scaling backwards effect, suggesting broad implications for data efficiency, licensing, and ethical considerations in pre-training. Overall, 1p-frac provides a pathway to minimal, fast, and potentially licensing-free pre-training with robust transfer across diverse vision tasks.

Abstract

Pre-training and transfer learning are an important building block of current computer vision systems. While pre-training is usually performed on large real-world image datasets, in this paper we ask whether this is truly necessary. To this end, we search for a minimal, purely synthetic pre-training dataset that allows us to achieve performance similar to the 1 million images of ImageNet-1k. We construct such a dataset from a single fractal with perturbations. With this, we contribute three main findings. (i) We show that pre-training is effective even with minimal synthetic images, with performance on par with large-scale pre-training datasets like ImageNet-1k for full fine-tuning. (ii) We investigate the single parameter with which we construct artificial categories for our dataset. We find that while the shape differences can be indistinguishable to humans, they are crucial for obtaining strong performances. (iii) Finally, we investigate the minimal requirements for successful pre-training. Surprisingly, we find that a substantial reduction of synthetic images from 1k to 1 can even lead to an increase in pre-training performance, a motivation to further investigate ''scaling backwards''. Finally, we extend our method from synthetic images to real images to see if a single real image can show similar pre-training effect through shape augmentation. We find that the use of grayscale images and affine transformations allows even real images to ''scale backwards''.

Scaling Backwards: Minimal Synthetic Pre-training?

TL;DR

This work questions the necessity of large-scale real-image pre-training by introducing 1p-frac, a minimal synthetic pre-training dataset derived from a single fractal. It pairs a locally perturbed cross entropy (LPCE) loss with a locally integrated empirical (LIEP) distribution to train a model from just one fractal image, yet achieves downstream performance comparable to ImageNet-1k pre-training in full fine-tuning. Key findings show that the shape perturbations encoded by fractal geometry are crucial for learning strong representations, that appropriate perturbation scales are necessary, and that complex fractal shapes outperform simple ones; strikingly, even a single fractal can outperform larger synthetic datasets under certain conditions. The study also demonstrates that with real images, grayscale contour-like representations plus affine transformations can reproduce the scaling backwards effect, suggesting broad implications for data efficiency, licensing, and ethical considerations in pre-training. Overall, 1p-frac provides a pathway to minimal, fast, and potentially licensing-free pre-training with robust transfer across diverse vision tasks.

Abstract

Pre-training and transfer learning are an important building block of current computer vision systems. While pre-training is usually performed on large real-world image datasets, in this paper we ask whether this is truly necessary. To this end, we search for a minimal, purely synthetic pre-training dataset that allows us to achieve performance similar to the 1 million images of ImageNet-1k. We construct such a dataset from a single fractal with perturbations. With this, we contribute three main findings. (i) We show that pre-training is effective even with minimal synthetic images, with performance on par with large-scale pre-training datasets like ImageNet-1k for full fine-tuning. (ii) We investigate the single parameter with which we construct artificial categories for our dataset. We find that while the shape differences can be indistinguishable to humans, they are crucial for obtaining strong performances. (iii) Finally, we investigate the minimal requirements for successful pre-training. Surprisingly, we find that a substantial reduction of synthetic images from 1k to 1 can even lead to an increase in pre-training performance, a motivation to further investigate ''scaling backwards''. Finally, we extend our method from synthetic images to real images to see if a single real image can show similar pre-training effect through shape augmentation. We find that the use of grayscale images and affine transformations allows even real images to ''scale backwards''.
Paper Structure (14 sections, 9 equations, 4 figures, 5 tables)

This paper contains 14 sections, 9 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Comparison of ImageNet-1k, FractalDB and 1p-frac (ours). 1p-frac consists of only a single fractal for pre-training. With 1p-frac, neural networks learn to classify perturbations applied to the fractal. In our study “single” means a very narrow distribution over parameters that leads to images that are roughly equivalent from a human visual perspective. While the shape differences of perturbed images can be indistinguishable to humans, models pre-trained on 1p-frac achieve comparable performance with those pre-trained on ImageNet-1k or FractalDB.
  • Figure 2: Scaling backwards from many images to a single synthetic image. (a) Empirical distribution $p_{\text{data}}$. Colors indicate classes. With a single image, the distribution is given by a single Dirac's delta function. (b) LIEP distribution $p_{\Delta}$. The support of the distribution narrows as the degree of perturbation $\Delta$ decreases. (c) $\sigma$-factor for investigating fractal shapes. A small $\sigma$ produces complex fractals.
  • Figure 4: Shape augmentation with three geometry transformations from a single real image.
  • Figure 5: The five different images used in the pre-training with a single real image and shape augmentation.