Table of Contents
Fetching ...

Can You Learn to See Without Images? Procedural Warm-Up for Vision Transformers

Zachary Shinnick, Liangze Jiang, Hemanth Saratchandran, Damien Teney, Anton van den Hengel

TL;DR

This work shows that Vision Transformers can acquire useful, domain-agnostic inductive biases from symbolic, non-visual data through a brief procedural warm-up. By generating sequences from formal grammars and training ViTs on these tokens with frozen embeddings and a masked-token objective, the authors reveal a training signal that complements standard visual pretraining. Remarkably, a $1\%$ symbolic-budget yields a $+1.7\%$ improvement on ImageNet-1k and can substitute up to $28\%$ of the image data, with gains primarily arising in the attention and deeper MLP layers due to structured hierarchical dependencies. The findings suggest a promising, data-efficient pathway for cross-domain pretraining of transformers, with potential applicability beyond vision to other modalities and tasks.

Abstract

Transformers show remarkable versatility across domains, suggesting the existence of inductive biases beneficial across modalities. In this work, we explore a new way to instil such generic biases in vision transformers (ViTs) by pretraining on procedurally-generated data devoid of visual or semantic content. We generate this data with simple algorithms such as formal grammars, so the results bear no relationship to either natural or synthetic images. We use this procedurally-generated data to pretrain ViTs in a warm-up phase that bypasses their visual patch embedding mechanisms, thus encouraging the models to internalise abstract computational priors. When followed by standard image-based training, this warm-up significantly improves data efficiency, convergence speed, and downstream performance. On ImageNet-1k for example, allocating just 1% of the training budget to procedural data improves final accuracy by over 1.7%. In terms of its effect on performance, 1% procedurally generated data is thus equivalent to 28% of the ImageNet-1k data. These findings suggest a promising path toward new data-efficient and domain-agnostic pretraining strategies.

Can You Learn to See Without Images? Procedural Warm-Up for Vision Transformers

TL;DR

This work shows that Vision Transformers can acquire useful, domain-agnostic inductive biases from symbolic, non-visual data through a brief procedural warm-up. By generating sequences from formal grammars and training ViTs on these tokens with frozen embeddings and a masked-token objective, the authors reveal a training signal that complements standard visual pretraining. Remarkably, a symbolic-budget yields a improvement on ImageNet-1k and can substitute up to of the image data, with gains primarily arising in the attention and deeper MLP layers due to structured hierarchical dependencies. The findings suggest a promising, data-efficient pathway for cross-domain pretraining of transformers, with potential applicability beyond vision to other modalities and tasks.

Abstract

Transformers show remarkable versatility across domains, suggesting the existence of inductive biases beneficial across modalities. In this work, we explore a new way to instil such generic biases in vision transformers (ViTs) by pretraining on procedurally-generated data devoid of visual or semantic content. We generate this data with simple algorithms such as formal grammars, so the results bear no relationship to either natural or synthetic images. We use this procedurally-generated data to pretrain ViTs in a warm-up phase that bypasses their visual patch embedding mechanisms, thus encouraging the models to internalise abstract computational priors. When followed by standard image-based training, this warm-up significantly improves data efficiency, convergence speed, and downstream performance. On ImageNet-1k for example, allocating just 1% of the training budget to procedural data improves final accuracy by over 1.7%. In terms of its effect on performance, 1% procedurally generated data is thus equivalent to 28% of the ImageNet-1k data. These findings suggest a promising path toward new data-efficient and domain-agnostic pretraining strategies.

Paper Structure

This paper contains 51 sections, 4 figures, 9 tables.

Figures (4)

  • Figure 1: We propose a new pretraining phase for vision transformers using synthetic data that is completely symbolic and non-visual. (Left) We first generate procedural data as sequences of abstract tokens using formal grammars (e.g. sequences of balanced parentheses). We bypass the ViT's visual patch embedding and directly train the rest of the model for standard masked-token prediction. (Right) When followed with standard image-based training (e.g. on ImageNet), our lightweight warm-up consistently improves convergence and downstream performance.
  • Figure 2: Procedural warm-up leads to a distinct and stronger optimization trajectory. When ImageNet-1k pretraining is held fixed, a model initialized with a brief procedural warm-up (in red) shows a clearly distinct training curve than when trained from default initialization (in gray). This suggests that the procedural warm-up provides a qualitatively different training signal rather than merely a head-start on standard visual pretraining.
  • Figure 3: Procedural warm-up reduces the ImageNet-1k data requirements. Replacing 1% of the total pretraining budget with 3.8 M procedural samples allows the model to match the accuracy of a full ImageNet-1k pretraining while using 28% fewer natural-image samples (about 108 M fewer images).
  • Figure 4: Downstream accuracy as a function of the number of warm-up steps. The accuracy peaks at an intermediate value. With excessive pretraining, the model likely over-specializes to the procedural data, which hinders adaptation to visual tasks.