Table of Contents
Fetching ...

Atlas: Multi-Scale Attention Improves Long Context Image Modeling

Kumar Krishna Agrawal, Long Lian, Longchao Liu, Natalia Harguindeguy, Boyi Li, Alexander Bick, Maggie Chung, Trevor Darrell, Adam Yala

TL;DR

Long-context image modeling at high resolutions is computationally challenging with limited cross-token interaction under standard attention. The authors introduce Multi-Scale Attention (MSA), which maintains $L=\log_S N$ scales and uses bi-directional cross-scale attention to fuse information across the entire image at $O(NK\log_S N)$ runtime. Built atop MSA, the Atlas architecture demonstrates a substantially improved compute-accuracy Pareto frontier on High-Resolution ImageNet-100 (HR-IN100), performing favorably against FasterViT, MambaVision, ConvNext, Swin, and LongViT across resolutions up to $4096\times4096$. A scale-dropping strategy and QKV caching further enhance efficiency, making the approach practical for very large images and applications in medicine, satellite imagery, and vision-language modeling. Overall, the work provides a scalable primitive for long-context vision with strong empirical gains and broad applicability.

Abstract

Efficiently modeling massive images is a long-standing challenge in machine learning. To this end, we introduce Multi-Scale Attention (MSA). MSA relies on two key ideas, (i) multi-scale representations (ii) bi-directional cross-scale communication. MSA creates O(log N) scales to represent the image across progressively coarser features and leverages cross-attention to propagate information across scales. We then introduce Atlas, a novel neural network architecture based on MSA. We demonstrate that Atlas significantly improves the compute-performance tradeoff of long-context image modeling in a high-resolution variant of ImageNet 100. At 1024px resolution, Atlas-B achieves 91.04% accuracy, comparable to ConvNext-B (91.92%) while being 4.3x faster. Atlas is 2.95x faster and 7.38% better than FasterViT, 2.25x faster and 4.96% better than LongViT. In comparisons against MambaVision-S, we find Atlas-S achieves 5%, 16% and 32% higher accuracy at 1024px, 2048px and 4096px respectively, while obtaining similar runtimes. Code for reproducing our experiments and pretrained models is available at https://github.com/yalalab/atlas.

Atlas: Multi-Scale Attention Improves Long Context Image Modeling

TL;DR

Long-context image modeling at high resolutions is computationally challenging with limited cross-token interaction under standard attention. The authors introduce Multi-Scale Attention (MSA), which maintains scales and uses bi-directional cross-scale attention to fuse information across the entire image at runtime. Built atop MSA, the Atlas architecture demonstrates a substantially improved compute-accuracy Pareto frontier on High-Resolution ImageNet-100 (HR-IN100), performing favorably against FasterViT, MambaVision, ConvNext, Swin, and LongViT across resolutions up to . A scale-dropping strategy and QKV caching further enhance efficiency, making the approach practical for very large images and applications in medicine, satellite imagery, and vision-language modeling. Overall, the work provides a scalable primitive for long-context vision with strong empirical gains and broad applicability.

Abstract

Efficiently modeling massive images is a long-standing challenge in machine learning. To this end, we introduce Multi-Scale Attention (MSA). MSA relies on two key ideas, (i) multi-scale representations (ii) bi-directional cross-scale communication. MSA creates O(log N) scales to represent the image across progressively coarser features and leverages cross-attention to propagate information across scales. We then introduce Atlas, a novel neural network architecture based on MSA. We demonstrate that Atlas significantly improves the compute-performance tradeoff of long-context image modeling in a high-resolution variant of ImageNet 100. At 1024px resolution, Atlas-B achieves 91.04% accuracy, comparable to ConvNext-B (91.92%) while being 4.3x faster. Atlas is 2.95x faster and 7.38% better than FasterViT, 2.25x faster and 4.96% better than LongViT. In comparisons against MambaVision-S, we find Atlas-S achieves 5%, 16% and 32% higher accuracy at 1024px, 2048px and 4096px respectively, while obtaining similar runtimes. Code for reproducing our experiments and pretrained models is available at https://github.com/yalalab/atlas.

Paper Structure

This paper contains 19 sections, 4 equations, 5 figures, 6 tables, 2 algorithms.

Figures (5)

  • Figure 1: (a) Training efficiency comparison of different vision architectures on HR-IN100 across increasing input resolutions (1024-4096px). (b) Atlas exhibits similar runtime scaling as MambaVision while obtaining significantly better accuracy.
  • Figure 2: The Atlas architecture consists of a convolutional stem for initial feature extraction, followed by a series of Multi-Scale Attention (MSA) blocks that progressively downsample the feature maps while preserving global context. This hierarchical design facilitates the effective processing of high-resolution images with efficient communication between features.
  • Figure 3: Illustration of top-down and bottom-up hierarchical communication in Multi-Scale Attention (MSA). The top-down Global Context Aggregation enables coarse-to-fine feature propagation. The bottom-up fine-to-coarse pathway propagates high resolution features into coarser scale representations.
  • Figure 4: Multi-Scale features with iterative summarization.
  • Figure : Multi-Scale Attention (MSA) Block