DreamWalk: Style Space Exploration using Diffusion Guidance
Michelle Shu, Charles Herrmann, Richard Strong Bowen, Forrester Cole, Ramin Zabih
TL;DR
DreamWalk addresses the challenge of fine-grained control in diffusion-based text-to-image generation by decomposing prompts into base content and style components and applying multiple, time- and space-dependent guidance terms within a single diffusion run. The method generalizes prior compositional guidance to allow independent emphasis of each component via guidance scale functions, enabling temporal (layout vs. texture) and spatial (region-specific) control without any network fine-tuning, and it remains compatible with DreamBooth and LoRA personalization. Key contributions include the formalization of multiple guidance terms with $s_i(t,u,v)$, temporal-varying control of style via $s_{ ext{style}}(t,u,v)$, and spatial masking to paint style onto selected regions; experiments demonstrate style intensity control, multi-style mixing, and subject personalization across SD1.5 and SDXL, complemented by a user study favoring DreamWalk over baselines. This approach provides practical, scalable fine-grained artistic control for diffusion models, enabling designers to explore style space while preserving composition. $ x_{t-1} = (x_t - f_\theta(x_t,t)) + \epsilon_t $ and $ f_\theta(x_t, t) = f_\theta(x_t, t, \emptyset) + \sum_{i=1}^{k} s_i(t, u, v) [ f_\theta(x_t,t, c_i) - f_\theta(x_t,t, \emptyset) ] $ illustrate the diffusion-guidance basis. DreamWalk demonstrates compatibility with both standard diffusion pipelines and fine-tuned variants, enabling broader adoption for stylization and personalization in real-world workflows.
Abstract
Text-conditioned diffusion models can generate impressive images, but fall short when it comes to fine-grained control. Unlike direct-editing tools like Photoshop, text conditioned models require the artist to perform "prompt engineering," constructing special text sentences to control the style or amount of a particular subject present in the output image. Our goal is to provide fine-grained control over the style and substance specified by the prompt, for example to adjust the intensity of styles in different regions of the image (Figure 1). Our approach is to decompose the text prompt into conceptual elements, and apply a separate guidance term for each element in a single diffusion process. We introduce guidance scale functions to control when in the diffusion process and \emph{where} in the image to intervene. Since the method is based solely on adjusting diffusion guidance, it does not require fine-tuning or manipulating the internal layers of the diffusion model's neural network, and can be used in conjunction with LoRA- or DreamBooth-trained models (Figure2). Project page: https://mshu1.github.io/dreamwalk.github.io/
