NARAIM: Native Aspect Ratio Autoregressive Image Models
Daniel Gallo Fernández, Robert van der Klis, Răzvan-Andrei Matişan, Janusz Partyka, Efstratios Gavves, Samuele Papa, Phillip Lippe
TL;DR
NARAIM tackles the problem that vision pre-training lacks language-like scaling by allowing autoregressive pre-training on images without distortive resizing. It preserves the native aspect ratio through an aspect-ratio–aware resize, patchify-with- raster-order processing, and a causal attention framework, while exploring fractional positional embeddings and regularizing augmentations. The approach yields improved downstream classification on ImageNet-1k, with ablations showing benefits from fractional embeddings, random crops, and normalization, and demonstrates robust performance across a range of aspect ratios. This work highlights the significance of maintaining original spatial context in autoregressive vision models and points to scalable paths toward stronger, more generalizable representations for non-square imagery.
Abstract
While vision transformers are able to solve a wide variety of computer vision tasks, no pre-training method has yet demonstrated the same scaling laws as observed in language models. Autoregressive models show promising results, but are commonly trained on images that are cropped or transformed into square images, which distorts or destroys information present in the input. To overcome this limitation, we propose NARAIM, a vision model pre-trained with an autoregressive objective that uses images in their native aspect ratio. By maintaining the native aspect ratio, we preserve the original spatial context, thereby enhancing the model's ability to interpret visual information. In our experiments, we show that maintaining the aspect ratio improves performance on a downstream classification task.
