Table of Contents
Fetching ...

DreamWalk: Style Space Exploration using Diffusion Guidance

Michelle Shu, Charles Herrmann, Richard Strong Bowen, Forrester Cole, Ramin Zabih

TL;DR

DreamWalk addresses the challenge of fine-grained control in diffusion-based text-to-image generation by decomposing prompts into base content and style components and applying multiple, time- and space-dependent guidance terms within a single diffusion run. The method generalizes prior compositional guidance to allow independent emphasis of each component via guidance scale functions, enabling temporal (layout vs. texture) and spatial (region-specific) control without any network fine-tuning, and it remains compatible with DreamBooth and LoRA personalization. Key contributions include the formalization of multiple guidance terms with $s_i(t,u,v)$, temporal-varying control of style via $s_{ ext{style}}(t,u,v)$, and spatial masking to paint style onto selected regions; experiments demonstrate style intensity control, multi-style mixing, and subject personalization across SD1.5 and SDXL, complemented by a user study favoring DreamWalk over baselines. This approach provides practical, scalable fine-grained artistic control for diffusion models, enabling designers to explore style space while preserving composition. $ x_{t-1} = (x_t - f_\theta(x_t,t)) + \epsilon_t $ and $ f_\theta(x_t, t) = f_\theta(x_t, t, \emptyset) + \sum_{i=1}^{k} s_i(t, u, v) [ f_\theta(x_t,t, c_i) - f_\theta(x_t,t, \emptyset) ] $ illustrate the diffusion-guidance basis. DreamWalk demonstrates compatibility with both standard diffusion pipelines and fine-tuned variants, enabling broader adoption for stylization and personalization in real-world workflows.

Abstract

Text-conditioned diffusion models can generate impressive images, but fall short when it comes to fine-grained control. Unlike direct-editing tools like Photoshop, text conditioned models require the artist to perform "prompt engineering," constructing special text sentences to control the style or amount of a particular subject present in the output image. Our goal is to provide fine-grained control over the style and substance specified by the prompt, for example to adjust the intensity of styles in different regions of the image (Figure 1). Our approach is to decompose the text prompt into conceptual elements, and apply a separate guidance term for each element in a single diffusion process. We introduce guidance scale functions to control when in the diffusion process and \emph{where} in the image to intervene. Since the method is based solely on adjusting diffusion guidance, it does not require fine-tuning or manipulating the internal layers of the diffusion model's neural network, and can be used in conjunction with LoRA- or DreamBooth-trained models (Figure2). Project page: https://mshu1.github.io/dreamwalk.github.io/

DreamWalk: Style Space Exploration using Diffusion Guidance

TL;DR

DreamWalk addresses the challenge of fine-grained control in diffusion-based text-to-image generation by decomposing prompts into base content and style components and applying multiple, time- and space-dependent guidance terms within a single diffusion run. The method generalizes prior compositional guidance to allow independent emphasis of each component via guidance scale functions, enabling temporal (layout vs. texture) and spatial (region-specific) control without any network fine-tuning, and it remains compatible with DreamBooth and LoRA personalization. Key contributions include the formalization of multiple guidance terms with , temporal-varying control of style via , and spatial masking to paint style onto selected regions; experiments demonstrate style intensity control, multi-style mixing, and subject personalization across SD1.5 and SDXL, complemented by a user study favoring DreamWalk over baselines. This approach provides practical, scalable fine-grained artistic control for diffusion models, enabling designers to explore style space while preserving composition. and illustrate the diffusion-guidance basis. DreamWalk demonstrates compatibility with both standard diffusion pipelines and fine-tuned variants, enabling broader adoption for stylization and personalization in real-world workflows.

Abstract

Text-conditioned diffusion models can generate impressive images, but fall short when it comes to fine-grained control. Unlike direct-editing tools like Photoshop, text conditioned models require the artist to perform "prompt engineering," constructing special text sentences to control the style or amount of a particular subject present in the output image. Our goal is to provide fine-grained control over the style and substance specified by the prompt, for example to adjust the intensity of styles in different regions of the image (Figure 1). Our approach is to decompose the text prompt into conceptual elements, and apply a separate guidance term for each element in a single diffusion process. We introduce guidance scale functions to control when in the diffusion process and \emph{where} in the image to intervene. Since the method is based solely on adjusting diffusion guidance, it does not require fine-tuning or manipulating the internal layers of the diffusion model's neural network, and can be used in conjunction with LoRA- or DreamBooth-trained models (Figure2). Project page: https://mshu1.github.io/dreamwalk.github.io/
Paper Structure (21 sections, 5 equations, 13 figures)

This paper contains 21 sections, 5 equations, 13 figures.

Figures (13)

  • Figure 1: DreamWalk allows fine-grained control of style text-to-image generation. We start with a base generated image (left), using the prompt "A river flows under a bridge with clear sky". We explore style space by independently increasing different styles at different locations. The middle row at center shows three images generated by increasing the pixel art style (using the prompt "Pixel art style train"), applied to the orange mask shown in the (+) column. The image at right shows the result of increasing all three styles in their mask-specified locations. All images shown are generated directly from diffusion using our guidance scale functions, and do not rely on image compositing or other post-processing. Generated with SDXL.
  • Figure 2: Controllable subject / prompt emphasis. Our formulation can explore adherence to a DreamBooth subject or adherence to the text prompt. Generated with SD1.5.
  • Figure 3: Applying a style without specifying a subject can, depending on the biases of the prompt's distribution, can lead to changes in the subjects in the generated image. E.g., "Hokusai style" distribution has a strong bias towards showing Hokusai's famous waves even when they overwhelm the original content of the image. Adding a subject to the prompt, like "house", can mitigate this. Generated with SDXL.
  • Figure 4: Controlling the emphasis. Left, the standard way of using text-to-image models is to specify everything in a single prompt and then guide towards it with the guidance term $g_\theta$ multiplied by scale $s$. However, this does not provide a natural way to emphasize individual parts of a single prompt. We note that prompts can be naturally decomposed, the simplest of which is base prompt (the objects, verbs, and setting; "a river flows under a bridge with clear sky") and any style application ("van Gogh"). Right, each decomposed prompt receives its own guidance term and scale function. By varying the scale function, we can walk towards a linear combination of terms, as denoted by the transparent arrows. This allows us to aim for any of the distributions represented by circle. This example depicts style interpolation with a layout from a base prompt. Note the unconditional distribution is not shown above.
  • Figure 5: Examining the norm of the denoising predictions at different time-steps $t$ suggest that the image is formed coarse-to-fine. Near $t=1$, the network edits the global image layout (low frequencies) whereas near $t=0$ the network seems to focus on texture (high frequencies). Note, the top row is the latent at different $t$ steps decoded through the VAE, bottom row is norm of denoising prediction for text conditioned pass of the denoising network.
  • ...and 8 more figures