Table of Contents
Fetching ...

Scribble-Guided Diffusion for Training-free Text-to-Image Generation

Seonho Lee, Jiho Choi, Seohyun Lim, Jiwook Kim, Hyunjung Shim

TL;DR

Scribble-Guided Diffusion (ScribbleDiff), a training-free approach that utilizes simple user-provided scribbles as visual prompts to guide image generation and introduces moment alignment and scribble propagation, which allow for more effective and flexible alignment between generated images and scribble inputs.

Abstract

Recent advancements in text-to-image diffusion models have demonstrated remarkable success, yet they often struggle to fully capture the user's intent. Existing approaches using textual inputs combined with bounding boxes or region masks fall short in providing precise spatial guidance, often leading to misaligned or unintended object orientation. To address these limitations, we propose Scribble-Guided Diffusion (ScribbleDiff), a training-free approach that utilizes simple user-provided scribbles as visual prompts to guide image generation. However, incorporating scribbles into diffusion models presents challenges due to their sparse and thin nature, making it difficult to ensure accurate orientation alignment. To overcome these challenges, we introduce moment alignment and scribble propagation, which allow for more effective and flexible alignment between generated images and scribble inputs. Experimental results on the PASCAL-Scribble dataset demonstrate significant improvements in spatial control and consistency, showcasing the effectiveness of scribble-based guidance in diffusion models. Our code is available at https://github.com/kaist-cvml-lab/scribble-diffusion.

Scribble-Guided Diffusion for Training-free Text-to-Image Generation

TL;DR

Scribble-Guided Diffusion (ScribbleDiff), a training-free approach that utilizes simple user-provided scribbles as visual prompts to guide image generation and introduces moment alignment and scribble propagation, which allow for more effective and flexible alignment between generated images and scribble inputs.

Abstract

Recent advancements in text-to-image diffusion models have demonstrated remarkable success, yet they often struggle to fully capture the user's intent. Existing approaches using textual inputs combined with bounding boxes or region masks fall short in providing precise spatial guidance, often leading to misaligned or unintended object orientation. To address these limitations, we propose Scribble-Guided Diffusion (ScribbleDiff), a training-free approach that utilizes simple user-provided scribbles as visual prompts to guide image generation. However, incorporating scribbles into diffusion models presents challenges due to their sparse and thin nature, making it difficult to ensure accurate orientation alignment. To overcome these challenges, we introduce moment alignment and scribble propagation, which allow for more effective and flexible alignment between generated images and scribble inputs. Experimental results on the PASCAL-Scribble dataset demonstrate significant improvements in spatial control and consistency, showcasing the effectiveness of scribble-based guidance in diffusion models. Our code is available at https://github.com/kaist-cvml-lab/scribble-diffusion.
Paper Structure (11 sections, 11 equations, 7 figures, 3 tables)

This paper contains 11 sections, 11 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Comparison of User Visual Prompts: Box, Scribble, and Mask in terms of usability, information amount, and directionality. $\bullet\quad$Usability (Easy to Difficult): Box $>$ Scribble $>$ Mask $\bullet\quad$Directionality (Low to High): Box $<$ Mask $<$ Scribble (Text Prompt: A painting of a dog riding a flying bicycle, over a big city with a yellowish full moon in the night sky.)
  • Figure 2: The overall architecture. Training-free Scribble-Guided Diffusion (ScribbleDiff) consists of two main components: Moment alignment and scribble propagation. The red arrows represent the main orientations of the distributions. and the anchors with high similarity (red rectangles) are gathered based on the scribble's anchors (yellow rectangles). (Text Prompt: The clouds drift high in the sky, casting soft, shifting shadows on the calm river below. A medieval bridge spans the width of the waterway.)
  • Figure 3: Impact of moment loss on object orientation. Moment loss improves alignment between the object’s orientation and the direction of the scribble. Without moment loss, the cat faces opposite to the scribble’s direction.
  • Figure 4: Effect of scribble propagation. With scribble propagation in Stable Diffusion, the scribble expands significantly by timestep, improving object shape and enhancing visual coherence.
  • Figure 5: Qualitative comparison of Text-to-Image generation methods using scribble prompts. ScribbleDiff produces results that better align with the scribble inputs, particularly in orientations and abstract shapes of the objects.
  • ...and 2 more figures