Table of Contents
Fetching ...

SceneBooth: Diffusion-based Framework for Subject-preserved Text-to-Image Generation

Shang Chai, Zihang Lin, Min Zhou, Xubin Li, Liansheng Zhuang, Houqiang Li

TL;DR

SceneBooth tackles subject-preserved text-to-image generation by fixing the foreground subject and generating a harmonious background guided by captions and object phrases. It introduces MCLayoutDM for multimodal layout generation and PaintNet for background painting using adapters (ControlNet and Gated Self-Attention) and instance-masked training. Experiments on COCO show significant improvements in subject fidelity, layout harmony, and image quality over inpainting baselines, corroborated by quantitative metrics and human judgments. The approach enables high-fidelity, semantically coherent scene synthesis with controllable subject placement, and demonstrates extensions like subject dragging and open-world generation, while acknowledging limitations in occlusion handling and crowded layouts.

Abstract

Due to the demand for personalizing image generation, subject-driven text-to-image generation method, which creates novel renditions of an input subject based on text prompts, has received growing research interest. Existing methods often learn subject representation and incorporate it into the prompt embedding to guide image generation, but they struggle with preserving subject fidelity. To solve this issue, this paper approaches a novel framework named SceneBooth for subject-preserved text-to-image generation, which consumes inputs of a subject image, object phrases and text prompts. Instead of learning the subject representation and generating a subject, our SceneBooth fixes the given subject image and generates its background image guided by the text prompts. To this end, our SceneBooth introduces two key components, i.e., a multimodal layout generation module and a background painting module. The former determines the position and scale of the subject by generating appropriate scene layouts that align with text captions, object phrases, and subject visual information. The latter integrates two adapters (ControlNet and Gated Self-Attention) into the latent diffusion model to generate a background that harmonizes with the subject guided by scene layouts and text descriptions. In this manner, our SceneBooth ensures accurate preservation of the subject's appearance in the output. Quantitative and qualitative experimental results demonstrate that SceneBooth significantly outperforms baseline methods in terms of subject preservation, image harmonization and overall quality.

SceneBooth: Diffusion-based Framework for Subject-preserved Text-to-Image Generation

TL;DR

SceneBooth tackles subject-preserved text-to-image generation by fixing the foreground subject and generating a harmonious background guided by captions and object phrases. It introduces MCLayoutDM for multimodal layout generation and PaintNet for background painting using adapters (ControlNet and Gated Self-Attention) and instance-masked training. Experiments on COCO show significant improvements in subject fidelity, layout harmony, and image quality over inpainting baselines, corroborated by quantitative metrics and human judgments. The approach enables high-fidelity, semantically coherent scene synthesis with controllable subject placement, and demonstrates extensions like subject dragging and open-world generation, while acknowledging limitations in occlusion handling and crowded layouts.

Abstract

Due to the demand for personalizing image generation, subject-driven text-to-image generation method, which creates novel renditions of an input subject based on text prompts, has received growing research interest. Existing methods often learn subject representation and incorporate it into the prompt embedding to guide image generation, but they struggle with preserving subject fidelity. To solve this issue, this paper approaches a novel framework named SceneBooth for subject-preserved text-to-image generation, which consumes inputs of a subject image, object phrases and text prompts. Instead of learning the subject representation and generating a subject, our SceneBooth fixes the given subject image and generates its background image guided by the text prompts. To this end, our SceneBooth introduces two key components, i.e., a multimodal layout generation module and a background painting module. The former determines the position and scale of the subject by generating appropriate scene layouts that align with text captions, object phrases, and subject visual information. The latter integrates two adapters (ControlNet and Gated Self-Attention) into the latent diffusion model to generate a background that harmonizes with the subject guided by scene layouts and text descriptions. In this manner, our SceneBooth ensures accurate preservation of the subject's appearance in the output. Quantitative and qualitative experimental results demonstrate that SceneBooth significantly outperforms baseline methods in terms of subject preservation, image harmonization and overall quality.
Paper Structure (17 sections, 4 equations, 8 figures, 5 tables)

This paper contains 17 sections, 4 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Results generated by subject-driven and subject-perseved text-to-image methods. Text prompt is "A perfume is placed in the snow." (a) Subject images. (b) Results generated by subject-driven method Dreambooth Ruiz2022DreamBoothFT. There is noticeable distortion in the color and appearance of the text "PERFUME PARIS". (c) Results generated by our subject-preserved method SceneBooth. The appearance of the perfume is well preserved.
  • Figure 2: Overview of our proposed SceneBooth. It consists of a layout generation module, MCLayoutDM, and a background painting module, PaintNet. We use the "*" symbol to mark the subject to preserve.
  • Figure 3: (a) Architecture of the layout denoiser in MCLayoutDM. Fourier, SA, CA, and FFN denote the fourier embedding layer, self-attention layer, cross-attention layer, and feed-forward network respectively. We use the "*" symbol to mark the feature embeddings representing the subject. For simplicity, we omit the layer normalization and skip connections in the Transformer blocks, as well as the diffusion timestep input $t$. (b) Architecture of the PaintNet. LN and GSA denote layer normalization and Gated Self-Attention, respectively. (Best viewed in color.)
  • Figure 4: Qualitative comparison with existing methods on COCO dataset. The subject in object phrases is highlighted in red.
  • Figure 5: Ablation study on the effectiveness of ControlNet. We qualitatively compare PaintNet with GLIGEN-repaint and GLIGEN-inpaint on test dataset. Ground-truth layouts are used as input.
  • ...and 3 more figures