General and Efficient Steering of Unconditional Diffusion

Qingsong Wang; Mikhail Belkin; Yusu Wang

General and Efficient Steering of Unconditional Diffusion

Qingsong Wang, Mikhail Belkin, Yusu Wang

TL;DR

This work targets controllable generation with unconditional diffusion models without inference-time gradients. It introduces Noise-Aligned RFM Steering (NA-RFM), which offline learns class directions via PCA statistics and Recursive Feature Machines, then applies two-stage, gradient-free guidance: noise alignment at high noise for coarse structure and RFM-based activation steering at lower noise for fine-grained control. The method shows substantial accuracy and image-quality improvements over gradient-based baselines across CIFAR-10, ImageNet, CelebA-HQ, and Birds-525, while achieving major inference speedups and requiring zero classifier evaluations at inference. The results demonstrate a scalable, generalizable approach to post-hoc controllable diffusion that leverages transferability of activation-space directions across timesteps and samples, reducing computational overhead without sacrificing fidelity.

Abstract

Guiding unconditional diffusion models typically requires either retraining with conditional inputs or per-step gradient computations (e.g., classifier-based guidance), both of which incur substantial computational overhead. We present a general recipe for efficiently steering unconditional diffusion {without gradient guidance during inference}, enabling fast controllable generation. Our approach is built on two observations about diffusion model structure: Noise Alignment: even in early, highly corrupted stages, coarse semantic steering is possible using a lightweight, offline-computed guidance signal, avoiding any per-step or per-sample gradients. Transferable concept vectors: a concept direction in activation space once learned transfers across both {timesteps} and {samples}; the same fixed steering vector learned near low noise level remains effective when injected at intermediate noise levels for every generation trajectory, providing refined conditional control with efficiency. Such concept directions can be efficiently and reliably identified via Recursive Feature Machine (RFM), a light-weight backpropagation-free feature learning method. Experiments on CIFAR-10, ImageNet, and CelebA demonstrate improved accuracy/quality over gradient-based guidance, while achieving significant inference speedups.

General and Efficient Steering of Unconditional Diffusion

TL;DR

Abstract

Paper Structure (27 sections, 12 equations, 15 figures, 18 tables, 1 algorithm)

This paper contains 27 sections, 12 equations, 15 figures, 18 tables, 1 algorithm.

Introduction
Background and Related Work
Diffusion Models
Method: Noise-Aligned RFM Steering
Method overview and motivating findings
Guidance for early stage: Noise alignment.
Guidance in late stage: RFM-Based direction discovery and transfer
RFM Training.
Steering Application.
CFG-Style Boosting.
Inference
Experiments
CIFAR-10: Controlled Benchmark
ImageNet: Scaling to Higher Resolution
CelebA: Multi-Attribute Guidance
...and 12 more sections

Figures (15)

Figure 1: Overview of Noise-Aligned RFM Steering.Offline: We compute class-conditional PCA statistics (cyan) and extract RFM steering directions from forward-process activations (orange). Inference: During sampling, we apply noise alignment at high noise levels to establish coarse class structure, then RFM steering at intermediate noise for fine-grained discriminative control. The two mechanisms require no classifier gradients at inference time.
Figure 2: Linear probe accuracy on forward vs reverse diffusion activations. We probe U-Net activations by training linear classifiers on features collected during reverse (sampling, solid lines) and forward (noising, dashed lines) processes across 5 representative blocks within U-Net architecture.
Figure 3: Temporal transfer of RFM directions. Cosine similarity heatmaps across three representative blocks show that directions exhibits temporal similarity for later stage blocks. $x$- and $y$-axis are diffusion timesteps, increasing top-down and left-right. The encoder block (left) exhibit the highest temporal stability with highest corss-time step cosine similarity, followed by decoder blocks (right) and middle blocks (middle).
Figure 4: Fine-grained bird species guidance. Correctly classified samples for OOD species: Lucifer Hummingbird (top-left), Scarlet Macaw (top-right), Brown Headed Cowbird (bottom-left), and Fairy Tern (bottom-right).
Figure 5: Training noise level ablation. Generation accuracy remains robust across $\sigma \in [0.6, 2.0]$, with modest degradation at $\sigma=5.0$.
...and 10 more figures

General and Efficient Steering of Unconditional Diffusion

TL;DR

Abstract

General and Efficient Steering of Unconditional Diffusion

Authors

TL;DR

Abstract

Table of Contents

Figures (15)