Table of Contents
Fetching ...

Training-free Dense-Aligned Diffusion Guidance for Modular Conditional Image Synthesis

Zixuan Wang, Duo Peng, Feng Chen, Yuwei Yang, Yinjie Lei

TL;DR

A novel approach that treats conditional image synthesis as the modular combination of diverse fundamental condition units, which enhances the model’s adaptability to diverse conditional generation tasks and greatly expands its application range.

Abstract

Conditional image synthesis is a crucial task with broad applications, such as artistic creation and virtual reality. However, current generative methods are often task-oriented with a narrow scope, handling a restricted condition with constrained applicability. In this paper, we propose a novel approach that treats conditional image synthesis as the modular combination of diverse fundamental condition units. Specifically, we divide conditions into three primary units: text, layout, and drag. To enable effective control over these conditions, we design a dedicated alignment module for each. For the text condition, we introduce a Dense Concept Alignment (DCA) module, which achieves dense visual-text alignment by drawing on diverse textual concepts. For the layout condition, we propose a Dense Geometry Alignment (DGA) module to enforce comprehensive geometric constraints that preserve the spatial configuration. For the drag condition, we introduce a Dense Motion Alignment (DMA) module to apply multi-level motion regularization, ensuring that each pixel follows its desired trajectory without visual artifacts. By flexibly inserting and combining these alignment modules, our framework enhances the model's adaptability to diverse conditional generation tasks and greatly expands its application range. Extensive experiments demonstrate the superior performance of our framework across a variety of conditions, including textual description, segmentation mask (bounding box), drag manipulation, and their combinations. Code is available at https://github.com/ZixuanWang0525/DADG.

Training-free Dense-Aligned Diffusion Guidance for Modular Conditional Image Synthesis

TL;DR

A novel approach that treats conditional image synthesis as the modular combination of diverse fundamental condition units, which enhances the model’s adaptability to diverse conditional generation tasks and greatly expands its application range.

Abstract

Conditional image synthesis is a crucial task with broad applications, such as artistic creation and virtual reality. However, current generative methods are often task-oriented with a narrow scope, handling a restricted condition with constrained applicability. In this paper, we propose a novel approach that treats conditional image synthesis as the modular combination of diverse fundamental condition units. Specifically, we divide conditions into three primary units: text, layout, and drag. To enable effective control over these conditions, we design a dedicated alignment module for each. For the text condition, we introduce a Dense Concept Alignment (DCA) module, which achieves dense visual-text alignment by drawing on diverse textual concepts. For the layout condition, we propose a Dense Geometry Alignment (DGA) module to enforce comprehensive geometric constraints that preserve the spatial configuration. For the drag condition, we introduce a Dense Motion Alignment (DMA) module to apply multi-level motion regularization, ensuring that each pixel follows its desired trajectory without visual artifacts. By flexibly inserting and combining these alignment modules, our framework enhances the model's adaptability to diverse conditional generation tasks and greatly expands its application range. Extensive experiments demonstrate the superior performance of our framework across a variety of conditions, including textual description, segmentation mask (bounding box), drag manipulation, and their combinations. Code is available at https://github.com/ZixuanWang0525/DADG.

Paper Structure

This paper contains 15 sections, 18 equations, 10 figures, 2 tables.

Figures (10)

  • Figure 1: The overview of our modular conditional image synthesis. Driven by modular combination of fundamental condition units, diverse visual content can be synthesized.
  • Figure 2: The overall pipeline of our Dense Concept Alignment (DCA) module. Beyond scene-level vision-language consistency, this module can guide synthesized image in achieving dense alignment of attribute and relation concepts within textual conditions.
  • Figure 3: The overall pipeline of our Dense Geometry Alignment (DGA) module. On the basis of the detection information from the approximated image, this module can densely align geometric features between the predicted and reference layouts, focusing on object-level location, size, and distance.
  • Figure 4: The overall pipeline of our Dense Motion Alignment (DMA) module. Using our densified reference drag flow, this module aims to guide synthesized image in achieving dense alignment of motion information across drag conditions, encompassing pixel displacement, appearance, and semantic aspects.
  • Figure 5: Visualization comparison between our approach (Stable Diffusion v3.0 esser2024scaling + DCA) and its baseline under varying descriptions. Our approach demonstrates superior performance in aligning semantic concepts within descriptions.
  • ...and 5 more figures