Can You Learn to See Without Images? Procedural Warm-Up for Vision Transformers

Zachary Shinnick; Liangze Jiang; Hemanth Saratchandran; Damien Teney; Anton van den Hengel

Can You Learn to See Without Images? Procedural Warm-Up for Vision Transformers

Zachary Shinnick, Liangze Jiang, Hemanth Saratchandran, Damien Teney, Anton van den Hengel

TL;DR

This work shows that Vision Transformers can acquire useful, domain-agnostic inductive biases from symbolic, non-visual data through a brief procedural warm-up. By generating sequences from formal grammars and training ViTs on these tokens with frozen embeddings and a masked-token objective, the authors reveal a training signal that complements standard visual pretraining. Remarkably, a $1\%$ symbolic-budget yields a $+1.7\%$ improvement on ImageNet-1k and can substitute up to $28\%$ of the image data, with gains primarily arising in the attention and deeper MLP layers due to structured hierarchical dependencies. The findings suggest a promising, data-efficient pathway for cross-domain pretraining of transformers, with potential applicability beyond vision to other modalities and tasks.

Abstract

Transformers show remarkable versatility across domains, suggesting the existence of inductive biases beneficial across modalities. In this work, we explore a new way to instil such generic biases in vision transformers (ViTs) by pretraining on procedurally-generated data devoid of visual or semantic content. We generate this data with simple algorithms such as formal grammars, so the results bear no relationship to either natural or synthetic images. We use this procedurally-generated data to pretrain ViTs in a warm-up phase that bypasses their visual patch embedding mechanisms, thus encouraging the models to internalise abstract computational priors. When followed by standard image-based training, this warm-up significantly improves data efficiency, convergence speed, and downstream performance. On ImageNet-1k for example, allocating just 1% of the training budget to procedural data improves final accuracy by over 1.7%. In terms of its effect on performance, 1% procedurally generated data is thus equivalent to 28% of the ImageNet-1k data. These findings suggest a promising path toward new data-efficient and domain-agnostic pretraining strategies.

Can You Learn to See Without Images? Procedural Warm-Up for Vision Transformers

TL;DR

Abstract

Can You Learn to See Without Images? Procedural Warm-Up for Vision Transformers

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (4)