Table of Contents
Fetching ...

DiffBrush:Just Painting the Art by Your Hands

Jiaming Chu, Lei Jin, Tao Wang, Junliang Xing, Jian Zhao

TL;DR

This work tackles the challenge of aligning text-driven diffusion-based painting with user intent while avoiding retraining. It introduces DiffBrush, a training-free framework that uses three energy-guidance terms—$G_{CL}$ for color, $G_{IS}$ for instance and semantics, and $G_{LR}$ for latent regeneration—to steer diffusion denoising toward user sketches, enabling both generation from scratch and editing of existing content. By operating on latent representations and attention maps, DiffBrush provides intuitive, brush-based control that preserves image harmony across color, semantic, and spatial aspects and remains compatible with SD, SDXL, and Flux without additional training. Quantitative and qualitative results demonstrate improved alignment with rough sketches compared to baselines like SDEdit and Self-Guidance, while ablative studies confirm the contribution of each guidance component. The approach reduces training costs and expands interactive painting capabilities, with practical impact for artists and designers seeking user-friendly, controllable AI-powered image creation workflows.

Abstract

The rapid development of image generation and editing algorithms in recent years has enabled ordinary user to produce realistic images. However, the current AI painting ecosystem predominantly relies on text-driven diffusion models (T2I), which pose challenges in accurately capturing user requirements. Furthermore, achieving compatibility with other modalities incurs substantial training costs. To this end, we introduce DiffBrush, which is compatible with T2I models and allows users to draw and edit images. By manipulating and adapting the internal representation of the diffusion model, DiffBrush guides the model-generated images to converge towards the user's hand-drawn sketches for user's specific needs without additional training. DiffBrush achieves control over the color, semantic, and instance of objects in images by continuously guiding the latent and instance-level attention map during the denoising process of the diffusion model. Besides, we propose a latent regeneration, which refines the randomly sampled noise in the diffusion model, obtaining a better image generation layout. Finally, users only need to roughly draw the mask of the instance (acceptable colors) on the canvas, DiffBrush can naturally generate the corresponding instance at the corresponding location.

DiffBrush:Just Painting the Art by Your Hands

TL;DR

This work tackles the challenge of aligning text-driven diffusion-based painting with user intent while avoiding retraining. It introduces DiffBrush, a training-free framework that uses three energy-guidance terms— for color, for instance and semantics, and for latent regeneration—to steer diffusion denoising toward user sketches, enabling both generation from scratch and editing of existing content. By operating on latent representations and attention maps, DiffBrush provides intuitive, brush-based control that preserves image harmony across color, semantic, and spatial aspects and remains compatible with SD, SDXL, and Flux without additional training. Quantitative and qualitative results demonstrate improved alignment with rough sketches compared to baselines like SDEdit and Self-Guidance, while ablative studies confirm the contribution of each guidance component. The approach reduces training costs and expands interactive painting capabilities, with practical impact for artists and designers seeking user-friendly, controllable AI-powered image creation workflows.

Abstract

The rapid development of image generation and editing algorithms in recent years has enabled ordinary user to produce realistic images. However, the current AI painting ecosystem predominantly relies on text-driven diffusion models (T2I), which pose challenges in accurately capturing user requirements. Furthermore, achieving compatibility with other modalities incurs substantial training costs. To this end, we introduce DiffBrush, which is compatible with T2I models and allows users to draw and edit images. By manipulating and adapting the internal representation of the diffusion model, DiffBrush guides the model-generated images to converge towards the user's hand-drawn sketches for user's specific needs without additional training. DiffBrush achieves control over the color, semantic, and instance of objects in images by continuously guiding the latent and instance-level attention map during the denoising process of the diffusion model. Besides, we propose a latent regeneration, which refines the randomly sampled noise in the diffusion model, obtaining a better image generation layout. Finally, users only need to roughly draw the mask of the instance (acceptable colors) on the canvas, DiffBrush can naturally generate the corresponding instance at the corresponding location.

Paper Structure

This paper contains 26 sections, 7 equations, 12 figures, 4 tables, 2 algorithms.

Figures (12)

  • Figure 1: The visualization of the attention maps of different transformer layers in Unet of SD 1.5. We choose the cross attention map of "castle" and self attention map of its feature center to visualize. Furthermore, we could find the deeper layers like $Down\_block.2$ or $Up\_block.0$ have clearer instance or semantic directionality.
  • Figure 2: We selected a multi-instance image with similar colors but different semantics and encode it by VAE encoder into latent space. Since VAE uses MSE loss as the reconstruction loss for supervision, we first selected a pixel with similar color (a), calculated its MSE distance from all other pixel features in the latent space $Z$ (b), and then inner product similarity (c) and cosine similarity (d).
  • Figure 3: DiffBrush framework comprises two stages: user painting and image generation. In the user painting stage, user inputs text, selects instances and attributes as brush semantics, draws on canvas (with different instances on separate layers), and also can edit based on a reference image. The result, mask, and semantics are packed into a triplet. In image generation, DiffBrush uses color, instance, and starting point constraints to guide generation, which is compatible with multiple models, and employs energy functions to balance user conditions and model freedom for generating desired images.
  • Figure 4: The effect of the Ins&Sem guidance. It shows that the original output under prompt condition, user's painting, the output with $G_{CL}$, and the output with $G_{CL}$ and $G_{IS}$.
  • Figure 5: Visualization of the changes in the attention maps under different guidance.
  • ...and 7 more figures