Table of Contents
Fetching ...

Sample- and Parameter-Efficient Auto-Regressive Image Models

Elad Amrani, Leonid Karlinsky, Alex Bronstein

TL;DR

XTRA introduces Block Causal Masking to auto-regressive image modeling, enabling block-wise attention and next-block reconstruction within a ViT encoder-decoder. The method achieves remarkable sample and parameter efficiency, requiring $152\times$ fewer samples and $7$–$16\times$ fewer parameters than prior AR image models while delivering superior or state-of-the-art performance on 15 diverse benchmarks and ImageNet-1K probing. These gains arise from learning low-frequency, semantically meaningful structures at block scales, rather than focusing on high-frequency detail, and from maintaining a simple, scalable training objective. The work demonstrates strong practical potential for scalable autoregressive CV models, with broad transfer capabilities and reduced resource requirements for pre-training and probing tasks.

Abstract

We introduce XTRA, a vision model pre-trained with a novel auto-regressive objective that significantly enhances both sample and parameter efficiency compared to previous auto-regressive image models. Unlike contrastive or masked image modeling methods, which have not been demonstrated as having consistent scaling behavior on unbalanced internet data, auto-regressive vision models exhibit scalable and promising performance as model and dataset size increase. In contrast to standard auto-regressive models, XTRA employs a Block Causal Mask, where each Block represents k $\times$ k tokens rather than relying on a standard causal mask. By reconstructing pixel values block by block, XTRA captures higher-level structural patterns over larger image regions. Predicting on blocks allows the model to learn relationships across broader areas of pixels, enabling more abstract and semantically meaningful representations than traditional next-token prediction. This simple modification yields two key results. First, XTRA is sample-efficient. Despite being trained on 152$\times$ fewer samples (13.1M vs. 2B), XTRA ViT-H/14 surpasses the top-1 average accuracy of the previous state-of-the-art auto-regressive model across 15 diverse image recognition benchmarks. Second, XTRA is parameter-efficient. Compared to auto-regressive models trained on ImageNet-1k, XTRA ViT-B/16 outperforms in linear and attentive probing tasks, using 7-16$\times$ fewer parameters (85M vs. 1.36B/0.63B).

Sample- and Parameter-Efficient Auto-Regressive Image Models

TL;DR

XTRA introduces Block Causal Masking to auto-regressive image modeling, enabling block-wise attention and next-block reconstruction within a ViT encoder-decoder. The method achieves remarkable sample and parameter efficiency, requiring fewer samples and fewer parameters than prior AR image models while delivering superior or state-of-the-art performance on 15 diverse benchmarks and ImageNet-1K probing. These gains arise from learning low-frequency, semantically meaningful structures at block scales, rather than focusing on high-frequency detail, and from maintaining a simple, scalable training objective. The work demonstrates strong practical potential for scalable autoregressive CV models, with broad transfer capabilities and reduced resource requirements for pre-training and probing tasks.

Abstract

We introduce XTRA, a vision model pre-trained with a novel auto-regressive objective that significantly enhances both sample and parameter efficiency compared to previous auto-regressive image models. Unlike contrastive or masked image modeling methods, which have not been demonstrated as having consistent scaling behavior on unbalanced internet data, auto-regressive vision models exhibit scalable and promising performance as model and dataset size increase. In contrast to standard auto-regressive models, XTRA employs a Block Causal Mask, where each Block represents k k tokens rather than relying on a standard causal mask. By reconstructing pixel values block by block, XTRA captures higher-level structural patterns over larger image regions. Predicting on blocks allows the model to learn relationships across broader areas of pixels, enabling more abstract and semantically meaningful representations than traditional next-token prediction. This simple modification yields two key results. First, XTRA is sample-efficient. Despite being trained on 152 fewer samples (13.1M vs. 2B), XTRA ViT-H/14 surpasses the top-1 average accuracy of the previous state-of-the-art auto-regressive model across 15 diverse image recognition benchmarks. Second, XTRA is parameter-efficient. Compared to auto-regressive models trained on ImageNet-1k, XTRA ViT-B/16 outperforms in linear and attentive probing tasks, using 7-16 fewer parameters (85M vs. 1.36B/0.63B).

Paper Structure

This paper contains 29 sections, 2 equations, 3 figures, 8 tables.

Figures (3)

  • Figure 1: XTRA Architecture. Following ViT dosovitskiy2021an an image is partitioned into a sequence of patches (numbered grid) and processed by a standard ViT encoder-decoder architecture with our proposed Block Causal Masking. I.e., causality is enforced at the block level with a rasterized pattern (see \ref{['fig:block_causal_masking']} for detailed explanation). In the example above a block represents 2$\times$2 patches/tokens. The token representations within each block at the output of the decoder are concatenated in a predetermined order such that each block of pixels is represented by a single embedding vector. Finally, each block embedding vector is passed through an MLP (same MLP for all blocks) to predict all pixel values of the next block in the sequence.
  • Figure 2: Block Causal Masking. In the image above the fine-grained grid represents a grid of patches (following ViT dosovitskiy2021an) that are processed by the model. The coarse-grained (numbered) grid represents a grid of blocks, where each block represents 4$\times$4 patches/tokens. Block Causal Masking enforces causality at the block level with a rasterized pattern (numbered sequence), ensuring that tokens can attend to others within the same block and also to preceding blocks.
  • Figure 3: Visualization of XTRA's Predictions on the ImageNet-1k Validation Set. XTRA generates predictions auto-regressively, producing one block of pixels at a time, with each new block conditioned on the preceding sequence of ground-truth blocks. Note that no loss is applied to the left upper block (first in sequence), yet it is of different color from image to image due to post-generation per block de-normalization, since loss is applied to normalized block pixels.