Table of Contents
Fetching ...

NARAIM: Native Aspect Ratio Autoregressive Image Models

Daniel Gallo Fernández, Robert van der Klis, Răzvan-Andrei Matişan, Janusz Partyka, Efstratios Gavves, Samuele Papa, Phillip Lippe

TL;DR

NARAIM tackles the problem that vision pre-training lacks language-like scaling by allowing autoregressive pre-training on images without distortive resizing. It preserves the native aspect ratio through an aspect-ratio–aware resize, patchify-with- raster-order processing, and a causal attention framework, while exploring fractional positional embeddings and regularizing augmentations. The approach yields improved downstream classification on ImageNet-1k, with ablations showing benefits from fractional embeddings, random crops, and normalization, and demonstrates robust performance across a range of aspect ratios. This work highlights the significance of maintaining original spatial context in autoregressive vision models and points to scalable paths toward stronger, more generalizable representations for non-square imagery.

Abstract

While vision transformers are able to solve a wide variety of computer vision tasks, no pre-training method has yet demonstrated the same scaling laws as observed in language models. Autoregressive models show promising results, but are commonly trained on images that are cropped or transformed into square images, which distorts or destroys information present in the input. To overcome this limitation, we propose NARAIM, a vision model pre-trained with an autoregressive objective that uses images in their native aspect ratio. By maintaining the native aspect ratio, we preserve the original spatial context, thereby enhancing the model's ability to interpret visual information. In our experiments, we show that maintaining the aspect ratio improves performance on a downstream classification task.

NARAIM: Native Aspect Ratio Autoregressive Image Models

TL;DR

NARAIM tackles the problem that vision pre-training lacks language-like scaling by allowing autoregressive pre-training on images without distortive resizing. It preserves the native aspect ratio through an aspect-ratio–aware resize, patchify-with- raster-order processing, and a causal attention framework, while exploring fractional positional embeddings and regularizing augmentations. The approach yields improved downstream classification on ImageNet-1k, with ablations showing benefits from fractional embeddings, random crops, and normalization, and demonstrates robust performance across a range of aspect ratios. This work highlights the significance of maintaining original spatial context in autoregressive vision models and points to scalable paths toward stronger, more generalizable representations for non-square imagery.

Abstract

While vision transformers are able to solve a wide variety of computer vision tasks, no pre-training method has yet demonstrated the same scaling laws as observed in language models. Autoregressive models show promising results, but are commonly trained on images that are cropped or transformed into square images, which distorts or destroys information present in the input. To overcome this limitation, we propose NARAIM, a vision model pre-trained with an autoregressive objective that uses images in their native aspect ratio. By maintaining the native aspect ratio, we preserve the original spatial context, thereby enhancing the model's ability to interpret visual information. In our experiments, we show that maintaining the aspect ratio improves performance on a downstream classification task.

Paper Structure

This paper contains 16 sections, 3 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: NARAIM approach. The input is divided into patches in row-major order, which are then processed by a vision transformer. The pre-training head utilizes the transformer's output to predict the next token based on the preceding ones. Meanwhile, the classification head, implemented with an attention probe, uses the transformer's output to predict a class.
  • Figure 2: Native aspect ratio resize. Given a crop, it is common to resize it to a fixed-sized square. Since the image is going to be patchified and fed to a transformer, and the transformer itself is agnostic to the spatial organization of the patches, we propose keeping the native aspect ratio. First, we reshape the image keeping the aspect ratio fixed, ensuring the total number of pixels does not exceed $224^2$. Then, we patchify the image, obtaining at most 256 patches.
  • Figure 3: The classification accuracy over image aspect ratios. NARAIM improves across all aspect ratios.
  • Figure 4: Aspect ratio distribution. This histogram shows the aspect ratios of the ImageNet-1k validation set. Most of the images are in landscape orientation, but we can observe three modes corresponding to portrait, square, and landscape images.
  • Figure 5: Prefix causal attention. For pre-training (left), we uniformly sample a prefix length $n$ during pre-training (e.g., $n = 3$). The attention for the first $n$ patches is set to be bidirectional and no loss will be computed for them. The rest of the patches adopt a causal mask and their loss is calculated. During fine-tuning to a downstream task (right), the mask is discarded. The gray patches represent the padding, which are added for reasons explained in Section \ref{['sec:method']}.
  • ...and 1 more figures