Table of Contents
Fetching ...

Ctrl-X: Controlling Structure and Appearance for Text-To-Image Generation Without Guidance

Kuan Heng Lin, Sicheng Mo, Ben Klingher, Fangzhou Mu, Bolei Zhou

TL;DR

Ctrl-X presents a training-free, guidance-free framework for controlling structure and appearance in text-to-image and text-to-video diffusion models. It achieves this via two core mechanisms: (1) feed-forward structure control that injects structure information from a structure image into early diffusion features, and (2) spatially-aware appearance transfer that leverages cross-attention-based semantic correspondence to transfer appearance statistics from an appearance image. The method operates without training or backpropagation at inference, enabling instant plug-and-play applicability across model architectures and novel condition inputs, and it demonstrates superior appearance alignment while maintaining competitive structure preservation versus baselines. Empirical results, ablations, and a user study substantiate its effectiveness, along with extensions to prompt-driven generation and T2V diffusion, and a discussion of limitations and safety considerations for real-world use.

Abstract

Recent controllable generation approaches such as FreeControl and Diffusion Self-Guidance bring fine-grained spatial and appearance control to text-to-image (T2I) diffusion models without training auxiliary modules. However, these methods optimize the latent embedding for each type of score function with longer diffusion steps, making the generation process time-consuming and limiting their flexibility and use. This work presents Ctrl-X, a simple framework for T2I diffusion controlling structure and appearance without additional training or guidance. Ctrl-X designs feed-forward structure control to enable the structure alignment with a structure image and semantic-aware appearance transfer to facilitate the appearance transfer from a user-input image. Extensive qualitative and quantitative experiments illustrate the superior performance of Ctrl-X on various condition inputs and model checkpoints. In particular, Ctrl-X supports novel structure and appearance control with arbitrary condition images of any modality, exhibits superior image quality and appearance transfer compared to existing works, and provides instant plug-and-play functionality to any T2I and text-to-video (T2V) diffusion model. See our project page for an overview of the results: https://genforce.github.io/ctrl-x

Ctrl-X: Controlling Structure and Appearance for Text-To-Image Generation Without Guidance

TL;DR

Ctrl-X presents a training-free, guidance-free framework for controlling structure and appearance in text-to-image and text-to-video diffusion models. It achieves this via two core mechanisms: (1) feed-forward structure control that injects structure information from a structure image into early diffusion features, and (2) spatially-aware appearance transfer that leverages cross-attention-based semantic correspondence to transfer appearance statistics from an appearance image. The method operates without training or backpropagation at inference, enabling instant plug-and-play applicability across model architectures and novel condition inputs, and it demonstrates superior appearance alignment while maintaining competitive structure preservation versus baselines. Empirical results, ablations, and a user study substantiate its effectiveness, along with extensions to prompt-driven generation and T2V diffusion, and a discussion of limitations and safety considerations for real-world use.

Abstract

Recent controllable generation approaches such as FreeControl and Diffusion Self-Guidance bring fine-grained spatial and appearance control to text-to-image (T2I) diffusion models without training auxiliary modules. However, these methods optimize the latent embedding for each type of score function with longer diffusion steps, making the generation process time-consuming and limiting their flexibility and use. This work presents Ctrl-X, a simple framework for T2I diffusion controlling structure and appearance without additional training or guidance. Ctrl-X designs feed-forward structure control to enable the structure alignment with a structure image and semantic-aware appearance transfer to facilitate the appearance transfer from a user-input image. Extensive qualitative and quantitative experiments illustrate the superior performance of Ctrl-X on various condition inputs and model checkpoints. In particular, Ctrl-X supports novel structure and appearance control with arbitrary condition images of any modality, exhibits superior image quality and appearance transfer compared to existing works, and provides instant plug-and-play functionality to any T2I and text-to-video (T2V) diffusion model. See our project page for an overview of the results: https://genforce.github.io/ctrl-x
Paper Structure (15 sections, 9 equations, 17 figures, 5 tables)

This paper contains 15 sections, 9 equations, 17 figures, 5 tables.

Figures (17)

  • Figure 1: Guidance-free structure and appearance control of Stable Diffusion XL (SDXL) podell2023sdxl Ctrl-X enables training-free and guidance-free zero-shot control of pretrained text-to-image diffusion models given any structure conditions and appearance images.
  • Figure 2: Visualizing early diffusion features. Using $20$ real, generated, and condition images of animals, we extract Stable Diffusion XL podell2023sdxl features right after decoder layer $0$ convolution. We visualize the top three principal components computed for each time step across all images. $t = 961$ to $881$ correspond to inference steps $1$ to $5$ of the DDIM scheduler with $50$ time steps. We obtain $\mathbf{x}_t$ by directly adding Gaussian noise to each clean image $\mathbf{x}_0$ via the diffusion forward process.
  • Figure 3: Overview of Ctrl-X. (a) At each sampling step $t$, we obtain $\mathbf{x}^\mathrm{s}_t$ and $\mathbf{x}^\mathrm{a}_t$ via the forward diffusion process, then feed them into the T2I diffusion model to obtain their convolution and self-attention features. Then, we inject convolution and self-attention features from $\mathbf{x}^\mathrm{s}_t$ and leverage self-attention correspondence to transfer spatially-aware appearance statistics from $\mathbf{x}^\mathrm{a}_t$ to $\mathbf{x}^\mathrm{o}_t$. (b) Details of our spatially-aware appearance transfer, where we exploit self-attention correspondence between $\mathbf{x}^\mathrm{o}_t$ and $\mathbf{x}^\mathrm{a}_t$ to compute weighted feature statistics $\mathbf{M}$ and $\mathbf{S}$ applied to $\mathbf{x}^\mathrm{o}_t$.
  • Figure 4: Qualitative results for T2I diffusion structure and appearance control and conditional generation. Ctrl-X supports a diverse variety of structure images for both (a) structure and appearance controllable generation and (b) prompt-driven conditional generation.
  • Figure 5: Qualitative comparison of structure and appearance control. Ctrl-X displays comparable structure control and superior appearance transfer compared to training-based methods. It is also more robust than guidance-based and guidance-free methods across diverse structure types.
  • ...and 12 more figures