Table of Contents
Fetching ...

LSReGen: Large-Scale Regional Generator via Backward Guidance Framework

Bowen Zhang, Cheng Yang, Xuanhui Liu

TL;DR

This work addresses controllable image generation for large-scale layouts by proposing a universal backward guidance framework that does not rely on cross-attention signals. Building on a pre-trained, low-parameter layout-to-image model (GLIGEN), LSReGen extracts low-frequency layout features from small-scale outputs and guides early diffusion sampling to produce high-quality, layout-consistent images at large resolutions. The method achieves superior performance on large-scale layout-to-image tasks compared with state-of-the-art baselines, while avoiding model training or fine-tuning. The results demonstrate the potential of training-free, geometry-guided diffusion control and offer open-source resources for broader applications.

Abstract

In recent years, advancements in AIGC (Artificial Intelligence Generated Content) technology have significantly enhanced the capabilities of large text-to-image models. Despite these improvements, controllable image generation remains a challenge. Current methods, such as training, forward guidance, and backward guidance, have notable limitations. The first two approaches either demand substantial computational resources or produce subpar results. The third approach depends on phenomena specific to certain model architectures, complicating its application to large-scale image generation.To address these issues, we propose a novel controllable generation framework that offers a generalized interpretation of backward guidance without relying on specific assumptions. Leveraging this framework, we introduce LSReGen, a large-scale layout-to-image method designed to generate high-quality, layout-compliant images. Experimental results show that LSReGen outperforms existing methods in the large-scale layout-to-image task, underscoring the effectiveness of our proposed framework. Our code and models will be open-sourced.

LSReGen: Large-Scale Regional Generator via Backward Guidance Framework

TL;DR

This work addresses controllable image generation for large-scale layouts by proposing a universal backward guidance framework that does not rely on cross-attention signals. Building on a pre-trained, low-parameter layout-to-image model (GLIGEN), LSReGen extracts low-frequency layout features from small-scale outputs and guides early diffusion sampling to produce high-quality, layout-consistent images at large resolutions. The method achieves superior performance on large-scale layout-to-image tasks compared with state-of-the-art baselines, while avoiding model training or fine-tuning. The results demonstrate the potential of training-free, geometry-guided diffusion control and offer open-source resources for broader applications.

Abstract

In recent years, advancements in AIGC (Artificial Intelligence Generated Content) technology have significantly enhanced the capabilities of large text-to-image models. Despite these improvements, controllable image generation remains a challenge. Current methods, such as training, forward guidance, and backward guidance, have notable limitations. The first two approaches either demand substantial computational resources or produce subpar results. The third approach depends on phenomena specific to certain model architectures, complicating its application to large-scale image generation.To address these issues, we propose a novel controllable generation framework that offers a generalized interpretation of backward guidance without relying on specific assumptions. Leveraging this framework, we introduce LSReGen, a large-scale layout-to-image method designed to generate high-quality, layout-compliant images. Experimental results show that LSReGen outperforms existing methods in the large-scale layout-to-image task, underscoring the effectiveness of our proposed framework. Our code and models will be open-sourced.
Paper Structure (16 sections, 6 equations, 8 figures, 2 tables)

This paper contains 16 sections, 6 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Our approach, LSReGen, takes bounding boxes as input and utilizes the low-parameter pre-trained layout-to-image model GLIGEN as a preprocessor to extract low-frequency information from images. The information serves as layout features to guide the sampling process, resulting in the generation of larger-scale images with richer elements and higher quality.
  • Figure 2: Backward guidance framework. Implementing backward guidance during the image sampling process enables controllable image generation. We propose a general interpretation that encompasses backward guidance. The feature extractor extracts approximate features of the ideal image. During sampling, controlling the target is achieved by minimizing the difference between the features of the intermediate variables and the ideal image.
  • Figure 3: The cross-attention maps of different Stable Diffusion versions. In the previous Stable Diffusion models, the attention maps of individual object words only display themselves, while other object words remain hidden in the background. For example, in the attention maps for the word "burger," the squirrel cannot be seen. In contrast, in the SDXL, with a different architecture from previous models, both objects can be clearly seen.
  • Figure 4: Overview of LSReGen. On the left, the general flow of our approach is illustrated. On the right are two different images generated using our approach, but their layouts remain consistent with the provided positional information.
  • Figure 5: Comparison with other methods in different sampling steps. Our method, GLIGENli2023gligen and layout-guidancechen2024training are hardly influenced by the number of sampling steps, while BoxDiffxie2023boxdiff needs more sampling steps to keep the accuracy of the objects' position.
  • ...and 3 more figures