Table of Contents
Fetching ...

Neighboring Autoregressive Modeling for Efficient Visual Generation

Yefei He, Yuanyu He, Shaoxuan He, Feng Chen, Hong Zhou, Kaipeng Zhang, Bohan Zhuang

TL;DR

Neighboring Autoregressive Modeling (NAR) reframes visual generation as a near-to-far outpainting task that preserves spatial-temporal locality and enables parallel decoding along orthogonal dimensions. By introducing dimension-oriented decoding heads and proximity-aware masks, NAR decouples token distributions across dimensions, dramatically reducing forward passes while improving image, video, and text-to-image quality. Empirical results show substantial throughput gains (up to ~13×) and competitive or superior FID/FVD/GenEval performance with relatively small training data, across ImageNet, UCF-101, and GenEval benchmarks. The work demonstrates practical efficiency gains and highlights data-efficient text-to-image generation, with potential for scaling to larger tokenizers and video datasets in future research.

Abstract

Visual autoregressive models typically adhere to a raster-order ``next-token prediction" paradigm, which overlooks the spatial and temporal locality inherent in visual content. Specifically, visual tokens exhibit significantly stronger correlations with their spatially or temporally adjacent tokens compared to those that are distant. In this paper, we propose Neighboring Autoregressive Modeling (NAR), a novel paradigm that formulates autoregressive visual generation as a progressive outpainting procedure, following a near-to-far ``next-neighbor prediction" mechanism. Starting from an initial token, the remaining tokens are decoded in ascending order of their Manhattan distance from the initial token in the spatial-temporal space, progressively expanding the boundary of the decoded region. To enable parallel prediction of multiple adjacent tokens in the spatial-temporal space, we introduce a set of dimension-oriented decoding heads, each predicting the next token along a mutually orthogonal dimension. During inference, all tokens adjacent to the decoded tokens are processed in parallel, substantially reducing the model forward steps for generation. Experiments on ImageNet$256\times 256$ and UCF101 demonstrate that NAR achieves 2.4$\times$ and 8.6$\times$ higher throughput respectively, while obtaining superior FID/FVD scores for both image and video generation tasks compared to the PAR-4X approach. When evaluating on text-to-image generation benchmark GenEval, NAR with 0.8B parameters outperforms Chameleon-7B while using merely 0.4 of the training data. Code is available at https://github.com/ThisisBillhe/NAR.

Neighboring Autoregressive Modeling for Efficient Visual Generation

TL;DR

Neighboring Autoregressive Modeling (NAR) reframes visual generation as a near-to-far outpainting task that preserves spatial-temporal locality and enables parallel decoding along orthogonal dimensions. By introducing dimension-oriented decoding heads and proximity-aware masks, NAR decouples token distributions across dimensions, dramatically reducing forward passes while improving image, video, and text-to-image quality. Empirical results show substantial throughput gains (up to ~13×) and competitive or superior FID/FVD/GenEval performance with relatively small training data, across ImageNet, UCF-101, and GenEval benchmarks. The work demonstrates practical efficiency gains and highlights data-efficient text-to-image generation, with potential for scaling to larger tokenizers and video datasets in future research.

Abstract

Visual autoregressive models typically adhere to a raster-order ``next-token prediction" paradigm, which overlooks the spatial and temporal locality inherent in visual content. Specifically, visual tokens exhibit significantly stronger correlations with their spatially or temporally adjacent tokens compared to those that are distant. In this paper, we propose Neighboring Autoregressive Modeling (NAR), a novel paradigm that formulates autoregressive visual generation as a progressive outpainting procedure, following a near-to-far ``next-neighbor prediction" mechanism. Starting from an initial token, the remaining tokens are decoded in ascending order of their Manhattan distance from the initial token in the spatial-temporal space, progressively expanding the boundary of the decoded region. To enable parallel prediction of multiple adjacent tokens in the spatial-temporal space, we introduce a set of dimension-oriented decoding heads, each predicting the next token along a mutually orthogonal dimension. During inference, all tokens adjacent to the decoded tokens are processed in parallel, substantially reducing the model forward steps for generation. Experiments on ImageNet and UCF101 demonstrate that NAR achieves 2.4 and 8.6 higher throughput respectively, while obtaining superior FID/FVD scores for both image and video generation tasks compared to the PAR-4X approach. When evaluating on text-to-image generation benchmark GenEval, NAR with 0.8B parameters outperforms Chameleon-7B while using merely 0.4 of the training data. Code is available at https://github.com/ThisisBillhe/NAR.

Paper Structure

This paper contains 18 sections, 1 equation, 11 figures, 5 tables.

Figures (11)

  • Figure 1: Generated samples from NAR. Results are shown for $512\times512$ text-guided image generation (1st row), $256\times256$ class-conditional image generation (2nd row) and $128\times128$ class-conditional video generation (3rd row).
  • Figure 2: Generation quality and efficiency comparisons between various visual generation methods. Data is collected from ImageNet $256\times256$ dataset over models with parameters around 300M.
  • Figure 3: Comparisons of different autoregressive visual generation paradigm. The proposed NAR paradigm formulates the generation process as an outpainting procedure, progressively expanding the boundary of the decoded token region. This approach effectively preserves locality, as all tokens near the starting point are consistently decoded before the current token.
  • Figure 4: Illustration of the dimension-oriented decoding heads. The horizontal head and the vertical head are responsible for predicting the next token in the row and column dimensions, respectively. Here, $L$ is the number of Transformer blocks in the backbone network.
  • Figure 5: Proximity-aware attention masks for the NAR paradigm. "S$n$" denotes the $n$-th generation step. Tokens generated within the same step are represented by the same color. To maintain the autoregressive property, a causal mask is applied between tokens across different generation steps (aligned with Figure \ref{['fig:paradigm_comparison']}). Within each step, bidirectional attention is employed among the tokens to enhance consistency during parallel generation.
  • ...and 6 more figures