Table of Contents
Fetching ...

RB-Modulation: Training-Free Personalization of Diffusion Models using Stochastic Optimal Control

Litu Rout, Yujia Chen, Nataniel Ruiz, Abhishek Kumar, Constantine Caramanis, Sanjay Shakkottai, Wen-Sheng Chu

TL;DR

RB-Modulation presents a training-free personalization framework for diffusion models by casting reverse diffusion as a stochastic optimal control problem with a terminal style cost, enabling precise control over style and content without adapters. The method combines a stochastic optimal controller (SOC) with an Attention Feature Aggregation (AFA) module to decouple content and style within cross-attention, and it provides practical algorithms for both small and large-scale models. Theoretical links between optimal control and reverse diffusion justify the terminal-cost approach, while Tweedie-based conditioning makes the controller causal for generative modeling. Empirically, RB-Modulation outperforms state-of-the-art training-free baselines in stylization and content-style composition, with strong human-preference signals and robust prompt alignment across datasets.

Abstract

We propose Reference-Based Modulation (RB-Modulation), a new plug-and-play solution for training-free personalization of diffusion models. Existing training-free approaches exhibit difficulties in (a) style extraction from reference images in the absence of additional style or content text descriptions, (b) unwanted content leakage from reference style images, and (c) effective composition of style and content. RB-Modulation is built on a novel stochastic optimal controller where a style descriptor encodes the desired attributes through a terminal cost. The resulting drift not only overcomes the difficulties above, but also ensures high fidelity to the reference style and adheres to the given text prompt. We also introduce a cross-attention-based feature aggregation scheme that allows RB-Modulation to decouple content and style from the reference image. With theoretical justification and empirical evidence, our framework demonstrates precise extraction and control of content and style in a training-free manner. Further, our method allows a seamless composition of content and style, which marks a departure from the dependency on external adapters or ControlNets.

RB-Modulation: Training-Free Personalization of Diffusion Models using Stochastic Optimal Control

TL;DR

RB-Modulation presents a training-free personalization framework for diffusion models by casting reverse diffusion as a stochastic optimal control problem with a terminal style cost, enabling precise control over style and content without adapters. The method combines a stochastic optimal controller (SOC) with an Attention Feature Aggregation (AFA) module to decouple content and style within cross-attention, and it provides practical algorithms for both small and large-scale models. Theoretical links between optimal control and reverse diffusion justify the terminal-cost approach, while Tweedie-based conditioning makes the controller causal for generative modeling. Empirically, RB-Modulation outperforms state-of-the-art training-free baselines in stylization and content-style composition, with strong human-preference signals and robust prompt alignment across datasets.

Abstract

We propose Reference-Based Modulation (RB-Modulation), a new plug-and-play solution for training-free personalization of diffusion models. Existing training-free approaches exhibit difficulties in (a) style extraction from reference images in the absence of additional style or content text descriptions, (b) unwanted content leakage from reference style images, and (c) effective composition of style and content. RB-Modulation is built on a novel stochastic optimal controller where a style descriptor encodes the desired attributes through a terminal cost. The resulting drift not only overcomes the difficulties above, but also ensures high fidelity to the reference style and adheres to the given text prompt. We also introduce a cross-attention-based feature aggregation scheme that allows RB-Modulation to decouple content and style from the reference image. With theoretical justification and empirical evidence, our framework demonstrates precise extraction and control of content and style in a training-free manner. Further, our method allows a seamless composition of content and style, which marks a departure from the dependency on external adapters or ControlNets.
Paper Structure (23 sections, 6 theorems, 38 equations, 14 figures, 3 tables, 2 algorithms)

This paper contains 23 sections, 6 theorems, 38 equations, 14 figures, 3 tables, 2 algorithms.

Key Result

Proposition 5.1

Suppose $A\in \mathbb{R}^{k\times d}$ be a linear style extractor that operates on the terminal state $X^u_1 \in \mathbb{R}^d$. Given reference style features $y_1$, consider the control problem: Then, in the limit when $\gamma \rightarrow \infty$, the optimal controller $u^* = \frac{\left(A^T A\right)^{-1} A^T\left(y_1 - A{\mathbf{x}}_t \right)}{1-t}$, which yields the following controlled dynam

Figures (14)

  • Figure 1: Given a single reference image (rounded rectangle), our method RB-Modulation offers a plug-and-play solution for (a) stylization, and (b) content-style composition with various prompts while maintaining sample diversity and prompt alignment. For instance, given a reference style image (e.g."melting golden 3d rendering style") and content image (e.g.(A) "dog"), our method adheres to the desired prompts without leaking contents from the reference style image and without being restricted to the pose of the reference content image.
  • Figure 2: Reference-Based Modulation
  • Figure 3: Qualitative results for stylization: A comparison with state-of-the-art methods (InstantStyle instantstyle, StyleAligned stylealigned, StyleDrop styledrop) highlights our advantages in preventing information leakage from the reference style and adhering more closely to desired prompts.
  • Figure 4: Ablation study: Our method builds on any transformer-based diffusion model. In this case, we use StableCascade sc as the foundation, and sequentially add each module to show their effectiveness. DirectConcat involves concatenating reference image embeddings with prompt embeddings. Style descriptions are excluded in this ablation study.
  • Figure 5: Qualitative results for content-style composition: Our method shows better prompt alignment and greater diversity than training-free methods IP-Adapter ipadapter and InstantStyle instantstyle, and have competitive performance with training-based ZipLoRA ziplora .
  • ...and 9 more figures

Theorems & Definitions (9)

  • Proposition 5.1
  • Proposition 5.2
  • Theorem A.1: HJB Equation, fleming2012deterministicbasar2020lecture
  • Proposition A.2: Linear optimal control with quadratic cost bridge
  • Remark A.3: Connections between diffusion-based generative modeling and stochastic optimal control
  • Proposition A.4
  • proof
  • Proposition A.5
  • proof