Table of Contents
Fetching ...

RichControl: Structure- and Appearance-Rich Training-Free Spatial Control for Text-to-Image Generation

Liheng Zhang, Lexi Pang, Hang Ye, Xiaoxuan Ma, Yizhou Wang

TL;DR

This work tackles zero-shot spatial control in text-to-image diffusion by revealing that existing training-free feature injections neglect the evolving interaction between structure preservation and domain alignment across diffusion steps. It introduces a training-free framework with three components—Structure-Rich Injection (SRI), Restart Refinement (RR), and Appearance-Rich Prompting (ARP)—and decouples the sampling schedule of condition features from the denoising process, showing that mid-range timesteps and a constant injection schedule yield robust results. The approach achieves state-of-the-art performance among training-free methods and can further enhance other pipelines like FreeControl, by improving structure fidelity and visual quality while maintaining inference efficiency. Overall, the paper provides a principled analysis of condition-feature schedules and demonstrates practical gains in structure- and appearance-rich controllable generation without additional training.

Abstract

Text-to-image (T2I) diffusion models have shown remarkable success in generating high-quality images from text prompts. Recent efforts extend these models to incorporate conditional images (e.g., canny edge) for fine-grained spatial control. Among them, feature injection methods have emerged as a training-free alternative to traditional fine-tuning-based approaches. However, they often suffer from structural misalignment, condition leakage, and visual artifacts, especially when the condition image diverges significantly from natural RGB distributions. Through an empirical analysis of existing methods, we identify a key limitation: the sampling schedule of condition features, previously unexplored, fails to account for the evolving interplay between structure preservation and domain alignment throughout diffusion steps. Inspired by this observation, we propose a flexible training-free framework that decouples the sampling schedule of condition features from the denoising process, and systematically investigate the spectrum of feature injection schedules for a higher-quality structure guidance in the feature space. Specifically, we find that condition features sampled from a single timestep are sufficient, yielding a simple yet efficient schedule that balances structure alignment and appearance quality. We further enhance the sampling process by introducing a restart refinement schedule, and improve the visual quality with an appearance-rich prompting strategy. Together, these designs enable training-free generation that is both structure-rich and appearance-rich. Extensive experiments show that our approach achieves state-of-the-art results across diverse zero-shot conditioning scenarios.

RichControl: Structure- and Appearance-Rich Training-Free Spatial Control for Text-to-Image Generation

TL;DR

This work tackles zero-shot spatial control in text-to-image diffusion by revealing that existing training-free feature injections neglect the evolving interaction between structure preservation and domain alignment across diffusion steps. It introduces a training-free framework with three components—Structure-Rich Injection (SRI), Restart Refinement (RR), and Appearance-Rich Prompting (ARP)—and decouples the sampling schedule of condition features from the denoising process, showing that mid-range timesteps and a constant injection schedule yield robust results. The approach achieves state-of-the-art performance among training-free methods and can further enhance other pipelines like FreeControl, by improving structure fidelity and visual quality while maintaining inference efficiency. Overall, the paper provides a principled analysis of condition-feature schedules and demonstrates practical gains in structure- and appearance-rich controllable generation without additional training.

Abstract

Text-to-image (T2I) diffusion models have shown remarkable success in generating high-quality images from text prompts. Recent efforts extend these models to incorporate conditional images (e.g., canny edge) for fine-grained spatial control. Among them, feature injection methods have emerged as a training-free alternative to traditional fine-tuning-based approaches. However, they often suffer from structural misalignment, condition leakage, and visual artifacts, especially when the condition image diverges significantly from natural RGB distributions. Through an empirical analysis of existing methods, we identify a key limitation: the sampling schedule of condition features, previously unexplored, fails to account for the evolving interplay between structure preservation and domain alignment throughout diffusion steps. Inspired by this observation, we propose a flexible training-free framework that decouples the sampling schedule of condition features from the denoising process, and systematically investigate the spectrum of feature injection schedules for a higher-quality structure guidance in the feature space. Specifically, we find that condition features sampled from a single timestep are sufficient, yielding a simple yet efficient schedule that balances structure alignment and appearance quality. We further enhance the sampling process by introducing a restart refinement schedule, and improve the visual quality with an appearance-rich prompting strategy. Together, these designs enable training-free generation that is both structure-rich and appearance-rich. Extensive experiments show that our approach achieves state-of-the-art results across diverse zero-shot conditioning scenarios.

Paper Structure

This paper contains 51 sections, 14 equations, 17 figures, 3 tables.

Figures (17)

  • Figure 1: Structure- and appearance-rich training-free spatial control for text-to-image generation. We propose a training-free framework that enables high-quality spatial control for pretrained text-to-image diffusion models under arbitrary spatial conditions. (Left) By introducing strong and intuitive structure and appearance control, our method effectively addresses key limitations of prior work such as Ctrl-X lin2024ctrlx, including structure misalignment, condition leakage, and artifacts, and (Right) achieves SOTA performance among all training-free methods; in the radar chart, greater distance from the center indicates superior results. Project page: https://zhang-liheng.github.io/richcontrol/.
  • Figure 2: The evolving curves of KL divergence and L2 distance of self-similarity matrices across diffusion timesteps. The denoising process starts from timestep 1000 (right).
  • Figure 3: Visualizing diffusion features extracted from the condition and natural images at different timesteps. We display the first principal component computed for each time step across all images.
  • Figure 4: Method overview. Our framework consists of three key components. (i) The Structure-Rich Injection (SRI) module injects structure-rich features $\mathbf{f}_{l,g(t)}^\text{struct}$ and attentions $\mathbf{A}_{l,g(t)}^\text{struct}$ of the condition image into the output image feature space to enable spatial control (Sec. \ref{['sec:method_feature']}). (ii) The Restart Refinement (RR) module refines the image $\mathbf{I}$ by iteratively adding noise and denoising the output, leading to more realistic visual details such as the eyes of the bear (Sec. \ref{['sec:method_restart']}). (iii) The Appearance-Rich Prompting (ARP) module enriches the prompt $\mathcal{P}$ based on the semantics of the condition image $\mathbf{I}^\text{struct}$, producing an appearance-rich prompt $\mathcal{P}^\text{app}$ and thereby providing a semantically-aligned appearance image $\mathbf{I}^\text{app}$ (Sec. \ref{['sec:method_prompt']}).
  • Figure 5: Qualitative comparison with existing methods. Our method effectively addresses common failure modes observed in previous methods: structure misalignment, condition leakage, and visual artifacts, generating high-quality images that adhere closely to the prompts with strong spatial alignment.
  • ...and 12 more figures