Table of Contents
Fetching ...

Lformer: Text-to-Image Generation with L-shape Block Parallel Decoding

Jiacheng Li, Longhui Wei, ZongYuan Zhan, Xin He, Siliang Tang, Qi Tian, Yueting Zhuang

TL;DR

Lformer introduces a semi-autoregressive approach to text-to-image generation that decodes discrete 2D image tokens in mirrored L-shape blocks, enabling parallel token generation while preserving autoregressive-like context. By combining this L-shape decoding with a CVAE-based global representation and CLIP-informed conditioning, the model achieves substantial speedups over traditional AR transformers while maintaining competitive image fidelity and diversity. The approach supports image editing without fine-tuning by rolling back to earlier steps or region-based inpainting using bounding boxes and prompts. Empirically, Lformer attains fast inference (especially with attention caching) and strong results on MMCelebA-HQ and MS-COCO, illustrating its practical potential for scalable, editable T2I generation.

Abstract

Generative transformers have shown their superiority in synthesizing high-fidelity and high-resolution images, such as good diversity and training stability. However, they suffer from the problem of slow generation since they need to generate a long token sequence autoregressively. To better accelerate the generative transformers while keeping good generation quality, we propose Lformer, a semi-autoregressive text-to-image generation model. Lformer firstly encodes an image into $h{\times}h$ discrete tokens, then divides these tokens into $h$ mirrored L-shape blocks from the top left to the bottom right and decodes the tokens in a block parallelly in each step. Lformer predicts the area adjacent to the previous context like autoregressive models thus it is more stable while accelerating. By leveraging the 2D structure of image tokens, Lformer achieves faster speed than the existing transformer-based methods while keeping good generation quality. Moreover, the pretrained Lformer can edit images without the requirement for finetuning. We can roll back to the early steps for regeneration or edit the image with a bounding box and a text prompt.

Lformer: Text-to-Image Generation with L-shape Block Parallel Decoding

TL;DR

Lformer introduces a semi-autoregressive approach to text-to-image generation that decodes discrete 2D image tokens in mirrored L-shape blocks, enabling parallel token generation while preserving autoregressive-like context. By combining this L-shape decoding with a CVAE-based global representation and CLIP-informed conditioning, the model achieves substantial speedups over traditional AR transformers while maintaining competitive image fidelity and diversity. The approach supports image editing without fine-tuning by rolling back to earlier steps or region-based inpainting using bounding boxes and prompts. Empirically, Lformer attains fast inference (especially with attention caching) and strong results on MMCelebA-HQ and MS-COCO, illustrating its practical potential for scalable, editable T2I generation.

Abstract

Generative transformers have shown their superiority in synthesizing high-fidelity and high-resolution images, such as good diversity and training stability. However, they suffer from the problem of slow generation since they need to generate a long token sequence autoregressively. To better accelerate the generative transformers while keeping good generation quality, we propose Lformer, a semi-autoregressive text-to-image generation model. Lformer firstly encodes an image into discrete tokens, then divides these tokens into mirrored L-shape blocks from the top left to the bottom right and decodes the tokens in a block parallelly in each step. Lformer predicts the area adjacent to the previous context like autoregressive models thus it is more stable while accelerating. By leveraging the 2D structure of image tokens, Lformer achieves faster speed than the existing transformer-based methods while keeping good generation quality. Moreover, the pretrained Lformer can edit images without the requirement for finetuning. We can roll back to the early steps for regeneration or edit the image with a bounding box and a text prompt.
Paper Structure (17 sections, 9 equations, 8 figures, 3 tables)

This paper contains 17 sections, 9 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Selected 256$\times$256 samples generated by our model. Lformer can generate both photorealistic and artistic content.
  • Figure 2: Generation process of Lformer (taking $h{\times}w=8\times8$ as an example). We divide an image token map into $h$ mirrored L-shape blocks from the top left to the bottom right. At each step, we generate all the tokens in an L-shape block parallelly.
  • Figure 3: L-shape Parallel Decoding. (a) We rearrange the tokens in L-order and generate the tokens in an L-shape Block (marked in colors) in a step parallelly. $\mathrm{C}$ denotes the condition, and $\mathrm{P}$ denotes a pad token. (b) Causal attention mask for the transformer implementation of Lformer. The tokens in a block can attend to the context of previously generated tokens which forms a square. (c) We add pad tokens at both sides of the previous L-block to align with the next L-block for transformer implementation.
  • Figure 4: Model structure of Lformer. We adopt CVAE to introduce a latent variable $w$ for global representation to the inconsistency of parallel generation. We concatenate $w$ to the global text feature as the final condition. The part in the dashed box is removed at inference time.
  • Figure 5: Random samples of 256$\times$256 generated by Lformer-L with the text prompts from the MMCelebA-HQ validation set.
  • ...and 3 more figures