RichControl: Structure- and Appearance-Rich Training-Free Spatial Control for Text-to-Image Generation
Liheng Zhang, Lexi Pang, Hang Ye, Xiaoxuan Ma, Yizhou Wang
TL;DR
This work tackles zero-shot spatial control in text-to-image diffusion by revealing that existing training-free feature injections neglect the evolving interaction between structure preservation and domain alignment across diffusion steps. It introduces a training-free framework with three components—Structure-Rich Injection (SRI), Restart Refinement (RR), and Appearance-Rich Prompting (ARP)—and decouples the sampling schedule of condition features from the denoising process, showing that mid-range timesteps and a constant injection schedule yield robust results. The approach achieves state-of-the-art performance among training-free methods and can further enhance other pipelines like FreeControl, by improving structure fidelity and visual quality while maintaining inference efficiency. Overall, the paper provides a principled analysis of condition-feature schedules and demonstrates practical gains in structure- and appearance-rich controllable generation without additional training.
Abstract
Text-to-image (T2I) diffusion models have shown remarkable success in generating high-quality images from text prompts. Recent efforts extend these models to incorporate conditional images (e.g., canny edge) for fine-grained spatial control. Among them, feature injection methods have emerged as a training-free alternative to traditional fine-tuning-based approaches. However, they often suffer from structural misalignment, condition leakage, and visual artifacts, especially when the condition image diverges significantly from natural RGB distributions. Through an empirical analysis of existing methods, we identify a key limitation: the sampling schedule of condition features, previously unexplored, fails to account for the evolving interplay between structure preservation and domain alignment throughout diffusion steps. Inspired by this observation, we propose a flexible training-free framework that decouples the sampling schedule of condition features from the denoising process, and systematically investigate the spectrum of feature injection schedules for a higher-quality structure guidance in the feature space. Specifically, we find that condition features sampled from a single timestep are sufficient, yielding a simple yet efficient schedule that balances structure alignment and appearance quality. We further enhance the sampling process by introducing a restart refinement schedule, and improve the visual quality with an appearance-rich prompting strategy. Together, these designs enable training-free generation that is both structure-rich and appearance-rich. Extensive experiments show that our approach achieves state-of-the-art results across diverse zero-shot conditioning scenarios.
