Table of Contents
Fetching ...

ControlNet-XS: Rethinking the Control of Text-to-Image Diffusion Models as Feedback-Control Systems

Denis Zavadski, Johann-Friedrich Feiden, Carsten Rother

TL;DR

This work reframes guided text-to-image diffusion as a feedback-control problem and shows that existing setups suffer from delayed, low-bandwidth feedback. It introduces ControlNet-XS, a compact controller that enables bidirectional, high-frequency communication with the generator, achieving state-of-the-art results for pixel-accurate guidance (depth, edges, semantic maps) while maintaining parity for looser guidance (human poses) and doubling inference/training speed. The approach substantially reduces parameter counts (e.g., 55M vs 361M ControlNet) and demonstrates strong performance on larger models like Stable Diffusion XL and SDXL, highlighting improved efficiency and democratization potential. The work also investigates semantic bias in large controllers and provides a foundation for integrating the method into broader feedback-control applications across AI-generated content domains.

Abstract

The field of image synthesis has made tremendous strides forward in the last years. Besides defining the desired output image with text-prompts, an intuitive approach is to additionally use spatial guidance in form of an image, such as a depth map. In state-of-the-art approaches, this guidance is realized by a separate controlling model that controls a pre-trained image generation network, such as a latent diffusion model. Understanding this process from a control system perspective shows that it forms a feedback-control system, where the control module receives a feedback signal from the generation process and sends a corrective signal back. When analysing existing systems, we observe that the feedback signals are timely sparse and have a small number of bits. As a consequence, there can be long delays between newly generated features and the respective corrective signals for these features. It is known that this delay is the most unwanted aspect of any control system. In this work, we take an existing controlling network (ControlNet) and change the communication between the controlling network and the generation process to be of high-frequency and with large-bandwidth. By doing so, we are able to considerably improve the quality of the generated images, as well as the fidelity of the control. Also, the controlling network needs noticeably fewer parameters and hence is about twice as fast during inference and training time. Another benefit of small-sized models is that they help to democratise our field and are likely easier to understand. We call our proposed network ControlNet-XS. When comparing with the state-of-the-art approaches, we outperform them for pixel-level guidance, such as depth, canny-edges, and semantic segmentation, and are on a par for loose keypoint-guidance of human poses. All code and pre-trained models will be made publicly available.

ControlNet-XS: Rethinking the Control of Text-to-Image Diffusion Models as Feedback-Control Systems

TL;DR

This work reframes guided text-to-image diffusion as a feedback-control problem and shows that existing setups suffer from delayed, low-bandwidth feedback. It introduces ControlNet-XS, a compact controller that enables bidirectional, high-frequency communication with the generator, achieving state-of-the-art results for pixel-accurate guidance (depth, edges, semantic maps) while maintaining parity for looser guidance (human poses) and doubling inference/training speed. The approach substantially reduces parameter counts (e.g., 55M vs 361M ControlNet) and demonstrates strong performance on larger models like Stable Diffusion XL and SDXL, highlighting improved efficiency and democratization potential. The work also investigates semantic bias in large controllers and provides a foundation for integrating the method into broader feedback-control applications across AI-generated content domains.

Abstract

The field of image synthesis has made tremendous strides forward in the last years. Besides defining the desired output image with text-prompts, an intuitive approach is to additionally use spatial guidance in form of an image, such as a depth map. In state-of-the-art approaches, this guidance is realized by a separate controlling model that controls a pre-trained image generation network, such as a latent diffusion model. Understanding this process from a control system perspective shows that it forms a feedback-control system, where the control module receives a feedback signal from the generation process and sends a corrective signal back. When analysing existing systems, we observe that the feedback signals are timely sparse and have a small number of bits. As a consequence, there can be long delays between newly generated features and the respective corrective signals for these features. It is known that this delay is the most unwanted aspect of any control system. In this work, we take an existing controlling network (ControlNet) and change the communication between the controlling network and the generation process to be of high-frequency and with large-bandwidth. By doing so, we are able to considerably improve the quality of the generated images, as well as the fidelity of the control. Also, the controlling network needs noticeably fewer parameters and hence is about twice as fast during inference and training time. Another benefit of small-sized models is that they help to democratise our field and are likely easier to understand. We call our proposed network ControlNet-XS. When comparing with the state-of-the-art approaches, we outperform them for pixel-level guidance, such as depth, canny-edges, and semantic segmentation, and are on a par for loose keypoint-guidance of human poses. All code and pre-trained models will be made publicly available.
Paper Structure (25 sections, 2 equations, 15 figures, 8 tables)

This paper contains 25 sections, 2 equations, 15 figures, 8 tables.

Figures (15)

  • Figure 1: Image synthesis using our approach with text-prompts, as well as, a guidance image in form of a depth map, canny-edges image, semantic map, and human pose. The two results on the left-hand side were generated by the production-quality model of Stable Diffusion XL Podell2023_SDXL, and the remaining by Stable Diffusion Version 1.5.
  • Figure 2: Feedback-Control System Perspective. In each figure (a-c) the generation process is on the left-hand side, and the control process on the right-hand side. The focus of this illustration is on the communication (directed arrows) between the generation and controlling process. (a) Feedback-control system for approaches UniControlNetUniControlZhang2023_ControlNetCocktail, where links denoted by * are only present in Cocktail. (b) An example of our communication design. (c) Zoom into the connections between a generative encoder block and a ControlNet-XS block. Please find the explanation for this figure in \ref{['subsec:controlsystem']}
  • Figure 3: Architectural choices. Different design-sketches for controlling a U-Net based generation process with a controlling network. The generation process is in each example on the left-hand side and the control process on the right-hand side. (a) The architecture of ControlNet Zhang2023_ControlNet. (b-c) Three new architectures (Type A-C) proposed in this work. We verify experimentally that model Type B performs better than Type A, and is on a par with Type C. We choose Type B as our final architecture, and call it ControlNet-XS, since it has fewer parameters than Type C.
  • Figure 4: The fidelity of the control reduces with smaller model sizes of ControlNet-XS. In the $55$M parameter model the complex structure of the street junction is identical to the one in the original image, as well as the skyscrapers in the upper-left corner. Smaller models with $11.7$M and $1.7$M parameters, respectively, are still guided by the control but less rigorously.
  • Figure 5: Semantic bias for depth control. Given the control depth map of a street scene and an unrelated text-prompt: "high quality photo of a delicious cake, 4k image". The large-sized ($361$M) ControlNet Zhang2023_ControlNet has a semantic bias and is unable to produce a cake scene with the given depth, independent of control strength $\alpha$. Our small-sized models with $11.7$M and $55$M respectively mitigate this bias.
  • ...and 10 more figures