Table of Contents
Fetching ...

FreeControl: Efficient, Training-Free Structural Control via One-Step Attention Extraction

Jiang Lin, Xinyu Chen, Song Wu, Zhiqiu Zhang, Jizhi Zhang, Ye Wang, Qiang Tang, Qian Wang, Jian Yang, Zili Yi

TL;DR

FreeControl tackles the challenge of structurally and semantically controlling diffusion-generated images without retraining. It presents a training-free, one-step attention-extraction method at a key timestep $t^*=661$ to obtain self-attention queries $Q_{t^*}^{(l)}$ which are injected across denoising and guided by layer-aware injection. Latent-Condition Decoupling (LCD) decouples the noised latent and the conditioning timestep, enabling more stable and finer-grained control, while compositional reference images support intuitive scene layout. The approach delivers strong structural fidelity with modest computational overhead and compatibility with fine-tuned or LoRA-augmented models, offering a practical test-time solution for structure-aware diffusion generation. This work enables efficient, flexible, and intuitive control directly from raw images, expanding the usability of diffusion models for design and imaging tasks.

Abstract

Controlling the spatial and semantic structure of diffusion-generated images remains a challenge. Existing methods like ControlNet rely on handcrafted condition maps and retraining, limiting flexibility and generalization. Inversion-based approaches offer stronger alignment but incur high inference cost due to dual-path denoising. We present FreeControl, a training-free framework for semantic structural control in diffusion models. Unlike prior methods that extract attention across multiple timesteps, FreeControl performs one-step attention extraction from a single, optimally chosen key timestep and reuses it throughout denoising. This enables efficient structural guidance without inversion or retraining. To further improve quality and stability, we introduce Latent-Condition Decoupling (LCD): a principled separation of the key timestep and the noised latent used in attention extraction. LCD provides finer control over attention quality and eliminates structural artifacts. FreeControl also supports compositional control via reference images assembled from multiple sources - enabling intuitive scene layout design and stronger prompt alignment. FreeControl introduces a new paradigm for test-time control, enabling structurally and semantically aligned, visually coherent generation directly from raw images, with the flexibility for intuitive compositional design and compatibility with modern diffusion models at approximately 5 percent additional cost.

FreeControl: Efficient, Training-Free Structural Control via One-Step Attention Extraction

TL;DR

FreeControl tackles the challenge of structurally and semantically controlling diffusion-generated images without retraining. It presents a training-free, one-step attention-extraction method at a key timestep to obtain self-attention queries which are injected across denoising and guided by layer-aware injection. Latent-Condition Decoupling (LCD) decouples the noised latent and the conditioning timestep, enabling more stable and finer-grained control, while compositional reference images support intuitive scene layout. The approach delivers strong structural fidelity with modest computational overhead and compatibility with fine-tuned or LoRA-augmented models, offering a practical test-time solution for structure-aware diffusion generation. This work enables efficient, flexible, and intuitive control directly from raw images, expanding the usability of diffusion models for design and imaging tasks.

Abstract

Controlling the spatial and semantic structure of diffusion-generated images remains a challenge. Existing methods like ControlNet rely on handcrafted condition maps and retraining, limiting flexibility and generalization. Inversion-based approaches offer stronger alignment but incur high inference cost due to dual-path denoising. We present FreeControl, a training-free framework for semantic structural control in diffusion models. Unlike prior methods that extract attention across multiple timesteps, FreeControl performs one-step attention extraction from a single, optimally chosen key timestep and reuses it throughout denoising. This enables efficient structural guidance without inversion or retraining. To further improve quality and stability, we introduce Latent-Condition Decoupling (LCD): a principled separation of the key timestep and the noised latent used in attention extraction. LCD provides finer control over attention quality and eliminates structural artifacts. FreeControl also supports compositional control via reference images assembled from multiple sources - enabling intuitive scene layout design and stronger prompt alignment. FreeControl introduces a new paradigm for test-time control, enabling structurally and semantically aligned, visually coherent generation directly from raw images, with the flexibility for intuitive compositional design and compatibility with modern diffusion models at approximately 5 percent additional cost.

Paper Structure

This paper contains 32 sections, 3 equations, 18 figures, 7 tables.

Figures (18)

  • Figure 1: FreeControl enables efficient, structure-aware generation from raw image references. Top-left: structure-conditioned generation using reference image on the left. Top-right: Tunable Control strength via adjustable attention injection. Bottom: compositional generation from user-assembled reference images enables intuitive spatial and semantic layout control.
  • Figure 2: The illustration of one-step attention extraction framework. The query attention matrices in the later layers (blue layers) are extracted from a forward-simulated latent at a single key timestep and are injected consistently in the generation process to enable structural guidance.
  • Figure 3: Left: Noisy artifacts induced by the noise term. Right: Different granularity of structural control under different key timesteps.
  • Figure 4: Qualitative comparisons on structure-conditioned image generation. Rows 1 and 3 show results where all methods are conditioned using the original caption of the reference image. Rows 2 and 4 present generations under stylized prompts to evaluate each method's ability to generalize beyond the original content.
  • Figure 5: Examples of compatibility with fine-tuned or LoRA-augmented models.
  • ...and 13 more figures