Table of Contents
Fetching ...

Autoregressive Models in Vision: A Survey

Jing Xiong, Gongye Liu, Lun Huang, Chengyue Wu, Taiqiang Wu, Yao Mu, Yuan Yao, Hui Shen, Zhongwei Wan, Jinfa Huang, Chaofan Tao, Shen Yan, Huaxiu Yao, Lingpeng Kong, Hongxia Yang, Mi Zhang, Guillermo Sapiro, Jiebo Luo, Ping Luo, Ngai Wong

TL;DR

The paper surveys autoregressive models in vision, arguing that visual data can be effectively modeled in pixel-, token-, or scale-based sequences. It systematically maps architectures, tokenizer designs, and generation tasks across image, video, 3D, and multimodal domains, comparing AR methods to VAEs, GANs, and diffusion approaches. Key contributions include a structured taxonomy, discussion of computational tradeoffs, and a forward-looking set of challenges and application roadmaps, supported by a broad bibliography and a public code repository. The work outlines how AR vision models scale, where they excel (quality, diversity, multimodal integration), and where diffusion-based methods still outperform them, pointing to hybrid strategies and large-scale multimodal AR systems as promising directions.

Abstract

Autoregressive modeling has been a huge success in the field of natural language processing (NLP). Recently, autoregressive models have emerged as a significant area of focus in computer vision, where they excel in producing high-quality visual content. Autoregressive models in NLP typically operate on subword tokens. However, the representation strategy in computer vision can vary in different levels, i.e., pixel-level, token-level, or scale-level, reflecting the diverse and hierarchical nature of visual data compared to the sequential structure of language. This survey comprehensively examines the literature on autoregressive models applied to vision. To improve readability for researchers from diverse research backgrounds, we start with preliminary sequence representation and modeling in vision. Next, we divide the fundamental frameworks of visual autoregressive models into three general sub-categories, including pixel-based, token-based, and scale-based models based on the representation strategy. We then explore the interconnections between autoregressive models and other generative models. Furthermore, we present a multifaceted categorization of autoregressive models in computer vision, including image generation, video generation, 3D generation, and multimodal generation. We also elaborate on their applications in diverse domains, including emerging domains such as embodied AI and 3D medical AI, with about 250 related references. Finally, we highlight the current challenges to autoregressive models in vision with suggestions about potential research directions. We have also set up a Github repository to organize the papers included in this survey at: https://github.com/ChaofanTao/Autoregressive-Models-in-Vision-Survey.

Autoregressive Models in Vision: A Survey

TL;DR

The paper surveys autoregressive models in vision, arguing that visual data can be effectively modeled in pixel-, token-, or scale-based sequences. It systematically maps architectures, tokenizer designs, and generation tasks across image, video, 3D, and multimodal domains, comparing AR methods to VAEs, GANs, and diffusion approaches. Key contributions include a structured taxonomy, discussion of computational tradeoffs, and a forward-looking set of challenges and application roadmaps, supported by a broad bibliography and a public code repository. The work outlines how AR vision models scale, where they excel (quality, diversity, multimodal integration), and where diffusion-based methods still outperform them, pointing to hybrid strategies and large-scale multimodal AR systems as promising directions.

Abstract

Autoregressive modeling has been a huge success in the field of natural language processing (NLP). Recently, autoregressive models have emerged as a significant area of focus in computer vision, where they excel in producing high-quality visual content. Autoregressive models in NLP typically operate on subword tokens. However, the representation strategy in computer vision can vary in different levels, i.e., pixel-level, token-level, or scale-level, reflecting the diverse and hierarchical nature of visual data compared to the sequential structure of language. This survey comprehensively examines the literature on autoregressive models applied to vision. To improve readability for researchers from diverse research backgrounds, we start with preliminary sequence representation and modeling in vision. Next, we divide the fundamental frameworks of visual autoregressive models into three general sub-categories, including pixel-based, token-based, and scale-based models based on the representation strategy. We then explore the interconnections between autoregressive models and other generative models. Furthermore, we present a multifaceted categorization of autoregressive models in computer vision, including image generation, video generation, 3D generation, and multimodal generation. We also elaborate on their applications in diverse domains, including emerging domains such as embodied AI and 3D medical AI, with about 250 related references. Finally, we highlight the current challenges to autoregressive models in vision with suggestions about potential research directions. We have also set up a Github repository to organize the papers included in this survey at: https://github.com/ChaofanTao/Autoregressive-Models-in-Vision-Survey.

Paper Structure

This paper contains 80 sections, 6 equations, 10 figures, 5 tables.

Figures (10)

  • Figure 1: We provide a timeline of representative visual autoregressive models, which illustrates the rapid evolution of visual autoregressive models from early pixel-based approaches like PixelRNN in 2016 to various advanced systems recently. We are excitedly witnessing the rapid growth in this field.
  • Figure 2: Literature taxonomy of autoregressive models in vision.
  • Figure 3: Illustration of three types of visual autoregressive models general frameworks based on their sequence representation strategies. Next-Pixel Prediction flattens the image into a pixel sequence. Next-Token Prediction converts the image into a token sequence via a visual tokenizer. Next-Scale Prediction employs a multi-scale tokenizer to generate a multi-scale sequence.
  • Figure 4: Core components in visual autoregressive models. (a) Sequence Representation encodes visual data into the discrete visual sequence, followed by reconstruction. (b) Autoregressive Sequence Modeling predicts each element sequentially.
  • Figure 5: Taxonomy of visual tokenizer in an unconditional generation.
  • ...and 5 more figures