Condition-Aware Neural Network for Controlled Image Generation

Han Cai; Muyang Li; Zhuoyang Zhang; Qinsheng Zhang; Ming-Yu Liu; Song Han

Condition-Aware Neural Network for Controlled Image Generation

Han Cai, Muyang Li, Zhuoyang Zhang, Qinsheng Zhang, Ming-Yu Liu, Song Han

TL;DR

This work tackles the challenge of controlling image generation by enabling conditional weight manipulation in neural networks. It introduces Condition-Aware Neural Network (CAN), which generates a conditional weight $W_c$ from a condition embedding and fuses it with the static weight $W$ to steer generation, applying this to diffusion transformers like DiT and UViT. The authors provide practical design guidelines, ablations showing the critical role of selecting a subset of condition-aware layers, and demonstrate substantial improvements in FID and CLIP controllability, while also achieving major efficiency gains by forming CaT with EfficientViT. These findings show that weight-space conditioning can outperform traditional conditioning methods and enable strong performance on large-scale image synthesis tasks with far lower computational costs, enabling practical deployment on edge devices. CAN thus offers a flexible, efficient approach to controlled image generation with clear applicability to real-world diffusion-based systems.

Abstract

We present Condition-Aware Neural Network (CAN), a new method for adding control to image generative models. In parallel to prior conditional control methods, CAN controls the image generation process by dynamically manipulating the weight of the neural network. This is achieved by introducing a condition-aware weight generation module that generates conditional weight for convolution/linear layers based on the input condition. We test CAN on class-conditional image generation on ImageNet and text-to-image generation on COCO. CAN consistently delivers significant improvements for diffusion transformer models, including DiT and UViT. In particular, CAN combined with EfficientViT (CaT) achieves 2.78 FID on ImageNet 512x512, surpassing DiT-XL/2 while requiring 52x fewer MACs per sampling step.

Condition-Aware Neural Network for Controlled Image Generation

TL;DR

from a condition embedding and fuses it with the static weight

to steer generation, applying this to diffusion transformers like DiT and UViT. The authors provide practical design guidelines, ablations showing the critical role of selecting a subset of condition-aware layers, and demonstrate substantial improvements in FID and CLIP controllability, while also achieving major efficiency gains by forming CaT with EfficientViT. These findings show that weight-space conditioning can outperform traditional conditioning methods and enable strong performance on large-scale image synthesis tasks with far lower computational costs, enabling practical deployment on edge devices. CAN thus offers a flexible, efficient approach to controlled image generation with clear applicability to real-world diffusion-based systems.

Abstract

Paper Structure (26 sections, 9 figures, 6 tables)

This paper contains 26 sections, 9 figures, 6 tables.

Introduction
Method
Condition-Aware Neural Network
Practical Design
Which Modules to be Condition-Aware?
CAN vs. Adaptive Kernel Selection.
Implementation.
Experiments
Setups
Datasets.
Evaluation Metric.
Implementation Details.
Ablation Study
Effectiveness of CAN.
Analysis.
...and 11 more sections

Figures (9)

Figure 1: Comparing CAN Models and Prior Image Generative Models on ImageNet 512$\times$512. With the new conditional control method, we significantly improve the performance of controlled image generative models. Combining CAN and EfficientViT cai2023efficientvit, our CaT model provides 52$\times$ MACs reduction per sampling step than DiT-XL/2 peebles2023scalable without performance loss.
Figure 2: Illustration of Condition-Aware Neural Network.Left: A regular neural network with static convolution/linear layers. Right: A condition-aware neural network and its equivalent form.
Figure 3: Overview of Applying CAN to Diffusion Transformer. The patch embedding layer, the output projection layers in self-attention, and the depthwise convolution (DW Conv) layers are condition-aware. The other layers are static. All output projection layers share the same conditional weight while still having their own static weights.
Figure 4: CAN is More Effective than Adaptive Kernel Selection.
Figure 5: Practical Implementation of CAN.Left: The condition-aware layers have different weights for different samples. A naive implementation requires running the kernel call independently for each sample, which incurs a large overhead for training and batch inference. Right: An efficient implementation for CAN. We fuse all kernel calls into a grouped convolution. We insert a batch-to-channel transformation before the kernel call and add a channel-to-batch conversion after the kernel call to preserve the functionality.
...and 4 more figures

Condition-Aware Neural Network for Controlled Image Generation

TL;DR

Abstract

Condition-Aware Neural Network for Controlled Image Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (9)