Table of Contents
Fetching ...

PlanGen: Towards Unified Layout Planning and Image Generation in Auto-Regressive Vision Language Models

Runze He, Bo Cheng, Yuhang Ma, Qingxiang Jia, Shanyuan Liu, Ao Ma, Xiaoyu Wu, Liebucha Wu, Dawei Leng, Yuhui Yin

TL;DR

PlanGen delivers a unified autoregressive vision-language framework that jointly handles layout planning and layout-to-image generation, avoiding the need for separate layout planners or embed-and-pool encodings. By employing unified prompting with dedicated layout and image tokens, plus tasks like image layout understanding and layout-guided manipulation, it achieves strong performance across layout planning, image synthesis, and editing tasks. Empirical results on diverse datasets show PlanGen surpassing diffusion-based baselines in layout adherence and image quality, while enabling accurate image layout understanding and effective object deletion with negative layout guidance. The work demonstrates the practicality of a single multitask model for complex spatial image generation and manipulation, with potential for higher-resolution extensions and alternative autoregressive strategies.

Abstract

In this paper, we propose a unified layout planning and image generation model, PlanGen, which can pre-plan spatial layout conditions before generating images. Unlike previous diffusion-based models that treat layout planning and layout-to-image as two separate models, PlanGen jointly models the two tasks into one autoregressive transformer using only next-token prediction. PlanGen integrates layout conditions into the model as context without requiring specialized encoding of local captions and bounding box coordinates, which provides significant advantages over the previous embed-and-pool operations on layout conditions, particularly when dealing with complex layouts. Unified prompting allows PlanGen to perform multitasking training related to layout, including layout planning, layout-to-image generation, image layout understanding, etc. In addition, PlanGen can be seamlessly expanded to layout-guided image manipulation thanks to the well-designed modeling, with teacher-forcing content manipulation policy and negative layout guidance. Extensive experiments verify the effectiveness of our PlanGen in multiple layoutrelated tasks, showing its great potential. Code is available at: https://360cvgroup.github.io/PlanGen.

PlanGen: Towards Unified Layout Planning and Image Generation in Auto-Regressive Vision Language Models

TL;DR

PlanGen delivers a unified autoregressive vision-language framework that jointly handles layout planning and layout-to-image generation, avoiding the need for separate layout planners or embed-and-pool encodings. By employing unified prompting with dedicated layout and image tokens, plus tasks like image layout understanding and layout-guided manipulation, it achieves strong performance across layout planning, image synthesis, and editing tasks. Empirical results on diverse datasets show PlanGen surpassing diffusion-based baselines in layout adherence and image quality, while enabling accurate image layout understanding and effective object deletion with negative layout guidance. The work demonstrates the practicality of a single multitask model for complex spatial image generation and manipulation, with potential for higher-resolution extensions and alternative autoregressive strategies.

Abstract

In this paper, we propose a unified layout planning and image generation model, PlanGen, which can pre-plan spatial layout conditions before generating images. Unlike previous diffusion-based models that treat layout planning and layout-to-image as two separate models, PlanGen jointly models the two tasks into one autoregressive transformer using only next-token prediction. PlanGen integrates layout conditions into the model as context without requiring specialized encoding of local captions and bounding box coordinates, which provides significant advantages over the previous embed-and-pool operations on layout conditions, particularly when dealing with complex layouts. Unified prompting allows PlanGen to perform multitasking training related to layout, including layout planning, layout-to-image generation, image layout understanding, etc. In addition, PlanGen can be seamlessly expanded to layout-guided image manipulation thanks to the well-designed modeling, with teacher-forcing content manipulation policy and negative layout guidance. Extensive experiments verify the effectiveness of our PlanGen in multiple layoutrelated tasks, showing its great potential. Code is available at: https://360cvgroup.github.io/PlanGen.

Paper Structure

This paper contains 23 sections, 6 equations, 12 figures, 6 tables.

Figures (12)

  • Figure 1: PlanGen models layout planning and image generation jointly, allowing layout planning before generating corresponding images, and the two processes are completed in a unified model. PlanGen can perform multi-type tasks related to layout, including a) layout-image joint generation, b) layout-to-image generation, c) image layout understanding and d) layout-guided image manipulation.
  • Figure 2: Different paradigms of image generation. a) The naive text-to-image cannot control the layout of the generated images. b) Previous methods use two independent models to complete layout planning and image generation. c) We adopt a unified model to complete layout planning and image generation.
  • Figure 3: Upper: PlanGen models layout planning and image generation jointly in an autoregressive visual-language model through a unified prompting design with the next-token prediction training objective. Lower: Illustration of PlanGen’s multitasking related to layout: a) layout-image joint generation, b) layout to image generation, c) image layout understanding and d) layout-guided image manipulation.
  • Figure 4: Prompt design of PlanGen. a) Prompt for Layout-Image Joint Generation. b) Prompt Example for Layout Condition. c) Prompt for Image Layout Understanding.
  • Figure 5: Examples for layout planning. Compared with Qwen-2.5-7b-instruct qwen2.5-7b-it and Llama-3.1-8b-instruct llama3.1, PlanGen generates more reasonable layout conditions from complex global captions. We use PlanGen to conduct layout-to-image generation for the layout conditions generated by the three methods to further observe the quality of the layout conditions.
  • ...and 7 more figures