Table of Contents
Fetching ...

Hierarchical Masked Autoregressive Models with Low-Resolution Token Pivots

Guangting Zheng, Yehao Li, Yingwei Pan, Jiajun Deng, Ting Yao, Yanyong Zhang, Tao Mei

TL;DR

This work tackles the limitation of single-scale context in next-token autoregressive image generation by introducing Hi-MAR, a two-phase, hierarchical framework that first predicts low-resolution tokens to capture global structure and then refines high-resolution tokens using those pivots. It adds a scale-aware Transformer backbone and a Diffusion Transformer head to enhance cross-token context and inter-token dependencies during diffusion-style refinement, achieving stronger results with lower computational cost. Empirically, Hi-MAR outperforms state-of-the-art diffusion and autoregressive baselines on ImageNet and MS-COCO across class-conditional and text-to-image generation tasks, while offering favorable speed-accuracy trade-offs. The approach demonstrates the practical value of incorporating global structure into visual autoregressive modeling and opens avenues for more efficient, globally informed generation pipelines.

Abstract

Autoregressive models have emerged as a powerful generative paradigm for visual generation. The current de-facto standard of next token prediction commonly operates over a single-scale sequence of dense image tokens, and is incapable of utilizing global context especially for early tokens prediction. In this paper, we introduce a new autoregressive design to model a hierarchy from a few low-resolution image tokens to the typical dense image tokens, and delve into a thorough hierarchical dependency across multi-scale image tokens. Technically, we present a Hierarchical Masked Autoregressive models (Hi-MAR) that pivot on low-resolution image tokens to trigger hierarchical autoregressive modeling in a multi-phase manner. Hi-MAR learns to predict a few image tokens in low resolution, functioning as intermediary pivots to reflect global structure, in the first phase. Such pivots act as the additional guidance to strengthen the next autoregressive modeling phase by shaping global structural awareness of typical dense image tokens. A new Diffusion Transformer head is further devised to amplify the global context among all tokens for mask token prediction. Extensive evaluations on both class-conditional and text-to-image generation tasks demonstrate that Hi-MAR outperforms typical AR baselines, while requiring fewer computational costs. Code is available at https://github.com/HiDream-ai/himar.

Hierarchical Masked Autoregressive Models with Low-Resolution Token Pivots

TL;DR

This work tackles the limitation of single-scale context in next-token autoregressive image generation by introducing Hi-MAR, a two-phase, hierarchical framework that first predicts low-resolution tokens to capture global structure and then refines high-resolution tokens using those pivots. It adds a scale-aware Transformer backbone and a Diffusion Transformer head to enhance cross-token context and inter-token dependencies during diffusion-style refinement, achieving stronger results with lower computational cost. Empirically, Hi-MAR outperforms state-of-the-art diffusion and autoregressive baselines on ImageNet and MS-COCO across class-conditional and text-to-image generation tasks, while offering favorable speed-accuracy trade-offs. The approach demonstrates the practical value of incorporating global structure into visual autoregressive modeling and opens avenues for more efficient, globally informed generation pipelines.

Abstract

Autoregressive models have emerged as a powerful generative paradigm for visual generation. The current de-facto standard of next token prediction commonly operates over a single-scale sequence of dense image tokens, and is incapable of utilizing global context especially for early tokens prediction. In this paper, we introduce a new autoregressive design to model a hierarchy from a few low-resolution image tokens to the typical dense image tokens, and delve into a thorough hierarchical dependency across multi-scale image tokens. Technically, we present a Hierarchical Masked Autoregressive models (Hi-MAR) that pivot on low-resolution image tokens to trigger hierarchical autoregressive modeling in a multi-phase manner. Hi-MAR learns to predict a few image tokens in low resolution, functioning as intermediary pivots to reflect global structure, in the first phase. Such pivots act as the additional guidance to strengthen the next autoregressive modeling phase by shaping global structural awareness of typical dense image tokens. A new Diffusion Transformer head is further devised to amplify the global context among all tokens for mask token prediction. Extensive evaluations on both class-conditional and text-to-image generation tasks demonstrate that Hi-MAR outperforms typical AR baselines, while requiring fewer computational costs. Code is available at https://github.com/HiDream-ai/himar.

Paper Structure

This paper contains 24 sections, 3 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: a) Next-token autoregressive (AR). GPT-style autoregressive models process 2D image tokens as a 1D sequence, predicting tokens in raster order using causal attention to ensure each token depends only on preceding ones. b) Next-token masked autoregressive model (MAR). BERT-style autoregressive models initially consider all tokens to be masked, subsequently predicting each masked token based on the known tokens in a random order, leveraging bidirectional attention to enable parallel prediction of a subset of tokens. c) Hierarchical mask autoregressive model (Hi-MAR). Hierarchical mask autoregressive models adapt a hierarchical prediction strategy to address the lack of global context in the next-token prediction. Hi-MAR first predicts a low-resolution image token sequence, which contains a few tokens, to reflect the global structure, and then pivots on these tokens to enhance and refine the next-resolution prediction.
  • Figure 2: (a) Pipeline of conventional hierarchical MAR. Conventional hierarchical mar uses a shared Transformer for both phases and directly leverages low-resolution visual tokens to guide second-phase predictions. (b) Pipeline of Hi-MAR. An image and its low-resolution counterpart are converted into token sequences at two scales. During inference, both sequences are initially masked. In the first phase, masked low-resolution tokens are processed by the Transformer to predict conditional tokens, followed by an MLP-based diffusion head for token reconstruction. In the second phase, the masked high-resolution tokens and the predicted conditional tokens in first phase are fed into the Transformer with the Diffusion Transformer head predicting the full high-resolution token sequence. (c) Scale-aware Transformer blocks consist of adaLN-Zero, layernorm, self-attention, and feed-forward layers. (d) MLP-based Diffusion head blocks include adaLN, layernorm, and feed-forward layers. (e) Diffusion Transformer head blocks are composed of adaLN, layernorm, self-attention, and feed-forward layers.
  • Figure 3: Speed/accuracy trade-off.
  • Figure 4: Impact of autoregressive steps. The experiments are conducted using the Hi-MAR-B model. For experiments varying low-resolution inference steps, typical dense inference steps are fixed at 4. Similarly, for experiments varying typical dense inference steps, low-resolution inference steps are fixed at 32.
  • Figure 5: Qualitative results on class-conditional image generation and text-to-image generation. The top rows show class-conditional generation, while the bottom rows show text-to-image generation.