Table of Contents
Fetching ...

Progressive Text-to-Image Generation

Zhengcong Fei, Mingyuan Fan, Li Zhu, Junshi Huang

TL;DR

The paper tackles inefficiencies and non-hierarchical token importance in vector-quantized autoregressive text-to-image generation by introducing a progressive, coarse-to-fine generation framework in the latent VQ-GAN space. It defines multi-stage image token prediction with two scoring strategies (quantization error and dynamic importance) and an image token revision mechanism to mitigate early-stage errors. Empirical results on MS COCO show improved FID and image-text alignment, along with substantial inference speedups (over 13x) compared with traditional autoregressive models. The approach offers an interpretable generation process with strong potential as a building block for scalable, high-fidelity text-to-image synthesis.

Abstract

Recently, Vector Quantized AutoRegressive (VQ-AR) models have shown remarkable results in text-to-image synthesis by equally predicting discrete image tokens from the top left to bottom right in the latent space. Although the simple generative process surprisingly works well, is this the best way to generate the image? For instance, human creation is more inclined to the outline-to-fine of an image, while VQ-AR models themselves do not consider any relative importance of image patches. In this paper, we present a progressive model for high-fidelity text-to-image generation. The proposed method takes effect by creating new image tokens from coarse to fine based on the existing context in a parallel manner, and this procedure is recursively applied with the proposed error revision mechanism until an image sequence is completed. The resulting coarse-to-fine hierarchy makes the image generation process intuitive and interpretable. Extensive experiments in MS COCO benchmark demonstrate that the progressive model produces significantly better results compared with the previous VQ-AR method in FID score across a wide variety of categories and aspects. Moreover, the design of parallel generation in each step allows more than $\times 13$ inference acceleration with slight performance loss.

Progressive Text-to-Image Generation

TL;DR

The paper tackles inefficiencies and non-hierarchical token importance in vector-quantized autoregressive text-to-image generation by introducing a progressive, coarse-to-fine generation framework in the latent VQ-GAN space. It defines multi-stage image token prediction with two scoring strategies (quantization error and dynamic importance) and an image token revision mechanism to mitigate early-stage errors. Empirical results on MS COCO show improved FID and image-text alignment, along with substantial inference speedups (over 13x) compared with traditional autoregressive models. The approach offers an interpretable generation process with strong potential as a building block for scalable, high-fidelity text-to-image synthesis.

Abstract

Recently, Vector Quantized AutoRegressive (VQ-AR) models have shown remarkable results in text-to-image synthesis by equally predicting discrete image tokens from the top left to bottom right in the latent space. Although the simple generative process surprisingly works well, is this the best way to generate the image? For instance, human creation is more inclined to the outline-to-fine of an image, while VQ-AR models themselves do not consider any relative importance of image patches. In this paper, we present a progressive model for high-fidelity text-to-image generation. The proposed method takes effect by creating new image tokens from coarse to fine based on the existing context in a parallel manner, and this procedure is recursively applied with the proposed error revision mechanism until an image sequence is completed. The resulting coarse-to-fine hierarchy makes the image generation process intuitive and interpretable. Extensive experiments in MS COCO benchmark demonstrate that the progressive model produces significantly better results compared with the previous VQ-AR method in FID score across a wide variety of categories and aspects. Moreover, the design of parallel generation in each step allows more than inference acceleration with slight performance loss.
Paper Structure (30 sections, 5 equations, 6 figures, 5 tables)

This paper contains 30 sections, 5 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Illustration of different generation orders for text-to-image synthesis. Conventional model generates vector quantized image sequence from left to right as top, while our progressive model creates image patches from coarse to fine as bottom.
  • Figure 2: Selected image samples generated from the progressive model and corresponding text prompts. Refer to Sec \ref{['sec:4.3']} for a more detailed discussion.
  • Figure 3: Overview of the proposed progressive text-to-image model, with left-to-right, random, and coarse-to-fine generation orders in the VQ-GAN latent space. Red symbols denote the error revision process.
  • Figure 4: Effects of image token revision.
  • Figure 5: Images generated from progressive model showing errors in number counting and negative semantic understanding, which motivates the future improvement.
  • ...and 1 more figures