Table of Contents
Fetching ...

Distilling semantically aware orders for autoregressive image generation

Rishav Pramanik, Antoine Poupon, Juan A. Rodriguez, Masih Aminbeidokhti, David Vazquez, Christopher Pal, Zhaozheng Yin, Marco Pedersoli

TL;DR

This paper tackles the problem that fixed raster-scan generation in autoregressive image models imposes an artificial sequence that may harm quality. It introduces Ordered Autoregressive (OAR) image generation, which first trains an any-order AR model, then distills the inferred semantically meaningful order and fine-tunes the model to follow that order, effectively turning a generative ordering problem into a self-supervised refinement. The method combines a dual positional encoding strategy (absolute current position and relative next position), distance-aware generation, and a distillation step to produce higher-quality images on Fashion Product and CelebA-HQ datasets, with competitive training costs. The practical impact lies in improving image realism and coherence in autoregressive, patch-based generation while maintaining compatibility with vision-language models and without requiring extra annotations. Overall, OAR demonstrates that content-aware generation orders can significantly enhance autoregressive image synthesis and provides a scalable path to deploy order-aware decoders in multimodal systems.

Abstract

Autoregressive patch-based image generation has recently shown competitive results in terms of image quality and scalability. It can also be easily integrated and scaled within Vision-Language models. Nevertheless, autoregressive models require a defined order for patch generation. While a natural order based on the dictation of the words makes sense for text generation, there is no inherent generation order that exists for image generation. Traditionally, a raster-scan order (from top-left to bottom-right) guides autoregressive image generation models. In this paper, we argue that this order is suboptimal, as it fails to respect the causality of the image content: for instance, when conditioned on a visual description of a sunset, an autoregressive model may generate clouds before the sun, even though the color of clouds should depend on the color of the sun and not the inverse. In this work, we show that first by training a model to generate patches in any-given-order, we can infer both the content and the location (order) of each patch during generation. Secondly, we use these extracted orders to finetune the any-given-order model to produce better-quality images. Through our experiments, we show on two datasets that this new generation method produces better images than the traditional raster-scan approach, with similar training costs and no extra annotations.

Distilling semantically aware orders for autoregressive image generation

TL;DR

This paper tackles the problem that fixed raster-scan generation in autoregressive image models imposes an artificial sequence that may harm quality. It introduces Ordered Autoregressive (OAR) image generation, which first trains an any-order AR model, then distills the inferred semantically meaningful order and fine-tunes the model to follow that order, effectively turning a generative ordering problem into a self-supervised refinement. The method combines a dual positional encoding strategy (absolute current position and relative next position), distance-aware generation, and a distillation step to produce higher-quality images on Fashion Product and CelebA-HQ datasets, with competitive training costs. The practical impact lies in improving image realism and coherence in autoregressive, patch-based generation while maintaining compatibility with vision-language models and without requiring extra annotations. Overall, OAR demonstrates that content-aware generation orders can significantly enhance autoregressive image synthesis and provides a scalable path to deploy order-aware decoders in multimodal systems.

Abstract

Autoregressive patch-based image generation has recently shown competitive results in terms of image quality and scalability. It can also be easily integrated and scaled within Vision-Language models. Nevertheless, autoregressive models require a defined order for patch generation. While a natural order based on the dictation of the words makes sense for text generation, there is no inherent generation order that exists for image generation. Traditionally, a raster-scan order (from top-left to bottom-right) guides autoregressive image generation models. In this paper, we argue that this order is suboptimal, as it fails to respect the causality of the image content: for instance, when conditioned on a visual description of a sunset, an autoregressive model may generate clouds before the sun, even though the color of clouds should depend on the color of the sun and not the inverse. In this work, we show that first by training a model to generate patches in any-given-order, we can infer both the content and the location (order) of each patch during generation. Secondly, we use these extracted orders to finetune the any-given-order model to produce better-quality images. Through our experiments, we show on two datasets that this new generation method produces better images than the traditional raster-scan approach, with similar training costs and no extra annotations.

Paper Structure

This paper contains 35 sections, 10 equations, 7 figures, 5 tables, 2 algorithms.

Figures (7)

  • Figure 1: Generation with our distilled order on the Fashion Product dataset (Left) and the Multimodal CelebA-HQ dataset (Right) with the corresponding generation order produced by our Ordered Autoregressive (OAR) model. The generation order is visualized through color intensity, progressing from yellow (early patches) to violet (later patches). Our learned order typically starts with simpler regions of the image before moving to more complex ones. For the Fashion Product dataset, this often means generating the white background first, while in the CelebA-HQ dataset, the model tends to begin with facial regions like the cheeks and chin, which are generally easier to generate.
  • Figure 2: Different Autoregressive (AR) models. (Top) A raster scan is the normal approach for autoregressive generation from top left to bottom-right. The input token contains the content $x_i$ and the position $l_i$. (Middle) Any-given-order learns to generate tokens at any possible location. However, the position of the next token should be given as input in an additional positional embedding. (Bottom) Our method, Ordered Autoregressive, uses the any-given-order model but generates all possible positions and selects the most likely one (darker yellow) as the next generated token.
  • Figure 3: Examples of generation on the Fashion Products dataset. (Top) Generated images with raster AR mode. (Middle) Generated images with ordered AR model. (Bottom) Generation order, from yellow to violet. From these images, we see that our approach finds an order highly correlated with the image content, often resulting in better image quality.
  • Figure 4: Examples of generation on the CelebA dataset. (Top) Generated images with raster AR mode. (Middle) Generated images with ordered AR model. (Bottom) Generation order, from yellow to violet. On this dataset our model generates first the salient parts of a face, leaving hair and background at the end. Our model produces images with greater smoothness, rich context and more aligned with the text
  • Figure 5: Generation order with absolute and relative positioning encoding. (Top) With absolute encoding the generation is very scattered. (Bottom) With relative positioning the generation is more localized. The average euclidean distance between the subsequently generated patches in case of absolute encoding is 5.78 whereas in case of relative encoding it is 4.34
  • ...and 2 more figures