Table of Contents
Fetching ...

Direction-Aware Diagonal Autoregressive Image Generation

Yijia Xu, Jianzhong Ju, Jian Luan, Jinshi Cui

TL;DR

This work introduces Direction-Aware Diagonal Autoregressive (DAR) image generation, which reorders image tokens along a diagonal path to keep adjacent tokens proximal while enriching directional context. It adds 4D-RoPE and direction embeddings to effectively handle frequent changes in generation direction and uses the image tokenizer’s codebook as frozen token embeddings, enabling a next-token autoregressive framework that remains compatible with language-model architectures. DAR scales across 485M–2.0B parameters, with the DAR-XL model achieving a state-of-the-art $FID$ of $1.37$ on 256×256 ImageNet, outperforming prior autoregressive methods. The combination of diagonal scanning, directional attention, and frozen codebook embeddings yields strong image fidelity and efficient sampling, highlighting a path toward unified multimodal foundation models.

Abstract

The raster-ordered image token sequence exhibits a significant Euclidean distance between index-adjacent tokens at line breaks, making it unsuitable for autoregressive generation. To address this issue, this paper proposes Direction-Aware Diagonal Autoregressive Image Generation (DAR) method, which generates image tokens following a diagonal scanning order. The proposed diagonal scanning order ensures that tokens with adjacent indices remain in close proximity while enabling causal attention to gather information from a broader range of directions. Additionally, two direction-aware modules: 4D-RoPE and direction embeddings are introduced, enhancing the model's capability to handle frequent changes in generation direction. To leverage the representational capacity of the image tokenizer, we use its codebook as the image token embeddings. We propose models of varying scales, ranging from 485M to 2.0B. On the 256$\times$256 ImageNet benchmark, our DAR-XL (2.0B) outperforms all previous autoregressive image generators, achieving a state-of-the-art FID score of 1.37.

Direction-Aware Diagonal Autoregressive Image Generation

TL;DR

This work introduces Direction-Aware Diagonal Autoregressive (DAR) image generation, which reorders image tokens along a diagonal path to keep adjacent tokens proximal while enriching directional context. It adds 4D-RoPE and direction embeddings to effectively handle frequent changes in generation direction and uses the image tokenizer’s codebook as frozen token embeddings, enabling a next-token autoregressive framework that remains compatible with language-model architectures. DAR scales across 485M–2.0B parameters, with the DAR-XL model achieving a state-of-the-art of on 256×256 ImageNet, outperforming prior autoregressive methods. The combination of diagonal scanning, directional attention, and frozen codebook embeddings yields strong image fidelity and efficient sampling, highlighting a path toward unified multimodal foundation models.

Abstract

The raster-ordered image token sequence exhibits a significant Euclidean distance between index-adjacent tokens at line breaks, making it unsuitable for autoregressive generation. To address this issue, this paper proposes Direction-Aware Diagonal Autoregressive Image Generation (DAR) method, which generates image tokens following a diagonal scanning order. The proposed diagonal scanning order ensures that tokens with adjacent indices remain in close proximity while enabling causal attention to gather information from a broader range of directions. Additionally, two direction-aware modules: 4D-RoPE and direction embeddings are introduced, enhancing the model's capability to handle frequent changes in generation direction. To leverage the representational capacity of the image tokenizer, we use its codebook as the image token embeddings. We propose models of varying scales, ranging from 485M to 2.0B. On the 256256 ImageNet benchmark, our DAR-XL (2.0B) outperforms all previous autoregressive image generators, achieving a state-of-the-art FID score of 1.37.

Paper Structure

This paper contains 15 sections, 6 equations, 13 figures, 7 tables.

Figures (13)

  • Figure 1: Generated samples of DAR. We present samples generated by DAR, which is trained on the 256$\times$256 ImageNet dataset.
  • Figure 2: Illustration of two image token arrangements. Subfigure \ref{['fig:order_raster']} shows the raster scan order. Subfigure \ref{['fig:order_diag']} shows the diagonal scan order.
  • Figure 3: Overview of the proposed Direction-Aware Diagonal Autoregressive (DAR) model. The discrete image tokens are arranged in the diagonal scanning order. These tokens are then processed through the codebook within the image tokenizer and MLP to obtain image token embeddings. The class embedding is prepended to the sequence, which is subsequently fed into the autoregressive transformer. Within the transformer block, 4D-rope that combines both the current and next positions is employed during multi-head attention. AdaLN calculates the scale and shift parameters using the sum of class embeddings and direction embeddings.
  • Figure 4: Scaling up behavior of DAR models. We show the training loss curve for models of varying scales, alongside the FID score curves with and without classifier-free guidance. As the model size scales, subfigure \ref{['fig:loss']} demonstrates a consistent reduction in loss, while subfigures \ref{['fig:fidwocfg']} and \ref{['fig:fidcfg']} illustrate a consistent decrease in FID score without and with classifier-free guidance, respectively.
  • Figure 5: Visualization of sample images generated by DAR of varying scales. As the model size increases, a noticeable improvement in image quality is observed.
  • ...and 8 more figures