Table of Contents
Fetching ...

Readout Guidance: Learning Control from Diffusion Features

Grace Luo, Trevor Darrell, Oliver Wang, Dan B Goldman, Aleksander Holynski

TL;DR

Readout Guidance enables flexible, sampling-time control of frozen diffusion models by training lightweight readout heads that extract target signals from intermediate diffusion features. The method reframes guidance from classifiers to regressors, using a distance-based loss and backpropagated gradients to steer sampling toward user-defined targets for both single-image properties (pose, depth, edges) and cross-image relations (correspondence, appearance similarity). It achieves strong, data-efficient control across drag-based manipulation, identity-consistent generation, and spatially aligned tasks, with markedly fewer training examples and parameters than prior adapters. The approach is modular, compatible with existing conditioning frameworks (e.g., ControlNet, T2I-Adapter), and provides a unified recipe for diverse conditional controls with a simple, shared architecture and sampling procedure.

Abstract

We present Readout Guidance, a method for controlling text-to-image diffusion models with learned signals. Readout Guidance uses readout heads, lightweight networks trained to extract signals from the features of a pre-trained, frozen diffusion model at every timestep. These readouts can encode single-image properties, such as pose, depth, and edges; or higher-order properties that relate multiple images, such as correspondence and appearance similarity. Furthermore, by comparing the readout estimates to a user-defined target, and back-propagating the gradient through the readout head, these estimates can be used to guide the sampling process. Compared to prior methods for conditional generation, Readout Guidance requires significantly fewer added parameters and training samples, and offers a convenient and simple recipe for reproducing different forms of conditional control under a single framework, with a single architecture and sampling procedure. We showcase these benefits in the applications of drag-based manipulation, identity-consistent generation, and spatially aligned control. Project page: https://readout-guidance.github.io.

Readout Guidance: Learning Control from Diffusion Features

TL;DR

Readout Guidance enables flexible, sampling-time control of frozen diffusion models by training lightweight readout heads that extract target signals from intermediate diffusion features. The method reframes guidance from classifiers to regressors, using a distance-based loss and backpropagated gradients to steer sampling toward user-defined targets for both single-image properties (pose, depth, edges) and cross-image relations (correspondence, appearance similarity). It achieves strong, data-efficient control across drag-based manipulation, identity-consistent generation, and spatially aligned tasks, with markedly fewer training examples and parameters than prior adapters. The approach is modular, compatible with existing conditioning frameworks (e.g., ControlNet, T2I-Adapter), and provides a unified recipe for diverse conditional controls with a simple, shared architecture and sampling procedure.

Abstract

We present Readout Guidance, a method for controlling text-to-image diffusion models with learned signals. Readout Guidance uses readout heads, lightweight networks trained to extract signals from the features of a pre-trained, frozen diffusion model at every timestep. These readouts can encode single-image properties, such as pose, depth, and edges; or higher-order properties that relate multiple images, such as correspondence and appearance similarity. Furthermore, by comparing the readout estimates to a user-defined target, and back-propagating the gradient through the readout head, these estimates can be used to guide the sampling process. Compared to prior methods for conditional generation, Readout Guidance requires significantly fewer added parameters and training samples, and offers a convenient and simple recipe for reproducing different forms of conditional control under a single framework, with a single architecture and sampling procedure. We showcase these benefits in the applications of drag-based manipulation, identity-consistent generation, and spatially aligned control. Project page: https://readout-guidance.github.io.
Paper Structure (27 sections, 7 equations, 24 figures, 3 tables)

This paper contains 27 sections, 7 equations, 24 figures, 3 tables.

Figures (24)

  • Figure 1: Given a frozen pre-trained text-to-image diffusion model Rombach_2022_CVPR, we learn parameter-efficient readout heads to interpret relevant signals, or readouts, from the intermediate network features. These readouts can be single-image concepts such as pose and depth, or relative concepts between two images, such as appearance similarity and correspondence. We use the readouts for sampling-time guidance to enable controlled image generation.
  • Figure 2: Drag Based Manipulation (Generated Images): We show generated images with a single user correspondence constraint (with overlay) followed by the Readout Guidance generated result. Please see the Supplemental for the associated text prompts.
  • Figure 3: Readout Head Training: (left) Readout heads convert frozen diffusion features into representations useful for a diverse set of tasks, including predicting (middle) correspondences between a source and target image and (right) an appearance similarity feature between an anchor and positive / negative images.
  • Figure 4: Spatially Aligned Controls (Generated Images): In each example, we show the input user control, provided as a pose, depth map, or edge map derived from a different image (not shown), as well as our generated result, and a visualization of the readout head output for the generated image. We show more results in the Supplemental.
  • Figure 5: Drag Based Manipulation (Real Images): The appearance similarity and correspondence feature head can operate on real images when seeding the reference features with those from DDIM inversion song2020denoising. We compare against the concurrent work DragDiffusion pan2023drag. Note that DragDiffusion requires an additional user input mask whereas our method does not.
  • ...and 19 more figures