Atlas: Multi-Scale Attention Improves Long Context Image Modeling
Kumar Krishna Agrawal, Long Lian, Longchao Liu, Natalia Harguindeguy, Boyi Li, Alexander Bick, Maggie Chung, Trevor Darrell, Adam Yala
TL;DR
Long-context image modeling at high resolutions is computationally challenging with limited cross-token interaction under standard attention. The authors introduce Multi-Scale Attention (MSA), which maintains $L=\log_S N$ scales and uses bi-directional cross-scale attention to fuse information across the entire image at $O(NK\log_S N)$ runtime. Built atop MSA, the Atlas architecture demonstrates a substantially improved compute-accuracy Pareto frontier on High-Resolution ImageNet-100 (HR-IN100), performing favorably against FasterViT, MambaVision, ConvNext, Swin, and LongViT across resolutions up to $4096\times4096$. A scale-dropping strategy and QKV caching further enhance efficiency, making the approach practical for very large images and applications in medicine, satellite imagery, and vision-language modeling. Overall, the work provides a scalable primitive for long-context vision with strong empirical gains and broad applicability.
Abstract
Efficiently modeling massive images is a long-standing challenge in machine learning. To this end, we introduce Multi-Scale Attention (MSA). MSA relies on two key ideas, (i) multi-scale representations (ii) bi-directional cross-scale communication. MSA creates O(log N) scales to represent the image across progressively coarser features and leverages cross-attention to propagate information across scales. We then introduce Atlas, a novel neural network architecture based on MSA. We demonstrate that Atlas significantly improves the compute-performance tradeoff of long-context image modeling in a high-resolution variant of ImageNet 100. At 1024px resolution, Atlas-B achieves 91.04% accuracy, comparable to ConvNext-B (91.92%) while being 4.3x faster. Atlas is 2.95x faster and 7.38% better than FasterViT, 2.25x faster and 4.96% better than LongViT. In comparisons against MambaVision-S, we find Atlas-S achieves 5%, 16% and 32% higher accuracy at 1024px, 2048px and 4096px respectively, while obtaining similar runtimes. Code for reproducing our experiments and pretrained models is available at https://github.com/yalalab/atlas.
