Table of Contents
Fetching ...

FreeControl: Training-Free Spatial Control of Any Text-to-Image Diffusion Model with Any Condition

Sicheng Mo, Fangzhou Mu, Kuan Heng Lin, Yanli Liu, Bochen Guan, Yin Li, Bolei Zhou

TL;DR

FreeControl tackles the challenge of training-free, fine-grained spatial control for text-to-image diffusion models by learning a semantic subspace of diffusion features from seed images and applying structure and appearance guidance in that subspace. The method achieves zero-shot control across multiple architectures and checkpoints and supports a wide range of input modalities for spatial conditions, including complex objects and graphics primitives. Through extensive experiments, FreeControl outperforms existing training-free baselines in structure preservation and image-text alignment, while approaching the quality of training-based controls. The approach reduces the need for per-condition retraining, enabling scalable, flexible design workflows for generative visual content.

Abstract

Recent approaches such as ControlNet offer users fine-grained spatial control over text-to-image (T2I) diffusion models. However, auxiliary modules have to be trained for each type of spatial condition, model architecture, and checkpoint, putting them at odds with the diverse intents and preferences a human designer would like to convey to the AI models during the content creation process. In this work, we present FreeControl, a training-free approach for controllable T2I generation that supports multiple conditions, architectures, and checkpoints simultaneously. FreeControl designs structure guidance to facilitate the structure alignment with a guidance image, and appearance guidance to enable the appearance sharing between images generated using the same seed. Extensive qualitative and quantitative experiments demonstrate the superior performance of FreeControl across a variety of pre-trained T2I models. In particular, FreeControl facilitates convenient training-free control over many different architectures and checkpoints, allows the challenging input conditions on which most of the existing training-free methods fail, and achieves competitive synthesis quality with training-based approaches.

FreeControl: Training-Free Spatial Control of Any Text-to-Image Diffusion Model with Any Condition

TL;DR

FreeControl tackles the challenge of training-free, fine-grained spatial control for text-to-image diffusion models by learning a semantic subspace of diffusion features from seed images and applying structure and appearance guidance in that subspace. The method achieves zero-shot control across multiple architectures and checkpoints and supports a wide range of input modalities for spatial conditions, including complex objects and graphics primitives. Through extensive experiments, FreeControl outperforms existing training-free baselines in structure preservation and image-text alignment, while approaching the quality of training-based controls. The approach reduces the need for per-condition retraining, enabling scalable, flexible design workflows for generative visual content.

Abstract

Recent approaches such as ControlNet offer users fine-grained spatial control over text-to-image (T2I) diffusion models. However, auxiliary modules have to be trained for each type of spatial condition, model architecture, and checkpoint, putting them at odds with the diverse intents and preferences a human designer would like to convey to the AI models during the content creation process. In this work, we present FreeControl, a training-free approach for controllable T2I generation that supports multiple conditions, architectures, and checkpoints simultaneously. FreeControl designs structure guidance to facilitate the structure alignment with a guidance image, and appearance guidance to enable the appearance sharing between images generated using the same seed. Extensive qualitative and quantitative experiments demonstrate the superior performance of FreeControl across a variety of pre-trained T2I models. In particular, FreeControl facilitates convenient training-free control over many different architectures and checkpoints, allows the challenging input conditions on which most of the existing training-free methods fail, and achieves competitive synthesis quality with training-based approaches.
Paper Structure (16 sections, 8 equations, 19 figures, 1 table)

This paper contains 16 sections, 8 equations, 19 figures, 1 table.

Figures (19)

  • Figure 1: Training-free conditional control of Stable Diffusion. (a) FreeControl enables zero-shot control of pretrained text-to-image diffusion models given an input condition image in any modality. (b) Compared to ControlNet zhang2023controlNet, FreeControl achieves a good balance between spatial and image-text alignment when facing a conflict between the guidance image and text description. Further, it supports condition types (e.g., 2D projections of point clouds and meshes in the borrow row) for which constructing training pairs is difficult.
  • Figure 2: Visualization of feature subspace given by PCA. Keys from the first self-attention in the U-Net decoder are obtained via DDIM inversion song2020ddim for five images in different styles and modalities (top: person; bottom: bedroom), and subsequently undergo PCA. The top three principal components (pseudo-colored in RGB) provide a clear separation of semantic components.
  • Figure 3: Method overview. (a) In the analysis stage, FreeControl generates seed images for a target concept (e.g., man) using a pretrained diffusion model and performs PCA on their diffusion features to obtain a linear subspace as semantic basis. (b) In the synthesis stage, FreeControl employs structure guidance in this subspace to enforce structure alignment with the input condition. In the meantime, it applies appearance guidance to facilitate appearance transfer from a sibling image generated using the same seed without structure control.
  • Figure 4: Qualitative comparison of controllable T2I diffusion. FreeControl supports a suite of control signals and three major versions of Stable Diffusion. The generated images closely follow the text prompts while exhibiting strong spatial alignment with the input images.
  • Figure 5: Qualitative results for more control conditions. FreeControl supports challenging control conditions not possible with training-based methods. These include 2D projections of common graphics primitives (row 1 and 2), domain-specific shape models (row 3 and 4), graphics software viewports (row 5), and simulated driving environments (row 6).
  • ...and 14 more figures