Table of Contents
Fetching ...

Universal Few-Shot Spatial Control for Diffusion Models

Kiet T. Nguyen, Chanhyuk Lee, Donggyun Kim, Dong Hoon Lee, Seunghoon Hong

TL;DR

This work introduces Universal Few-shot Control (UFC), a unified, data-efficient framework for steering frozen diffusion models with unseen spatial conditions. UFC leverages patch-wise matching over a small support set to interpolate task-specific control features, combined with episodic meta-training and parameter-efficient fine-tuning to generalize across diverse spatial modalities. In extensive experiments across six spatial tasks and two backbones (UNet and DiT), UFC delivers strong few-shot controllability with as little as 30 annotated examples and remains competitive with fully supervised baselines at 0.1% of training data, while maintaining solid image quality. The approach offers practical, versatile spatial control for diffusion models, highlighting its potential for flexible content generation under limited labeled data and across architectures. Limitations include focus on spatial control rather than appearance-preserving tasks and the need for some fine-tuning for new controls, suggesting avenues for future research in in-context-like adaptation and broader task applicability.

Abstract

Spatial conditioning in pretrained text-to-image diffusion models has significantly improved fine-grained control over the structure of generated images. However, existing control adapters exhibit limited adaptability and incur high training costs when encountering novel spatial control conditions that differ substantially from the training tasks. To address this limitation, we propose Universal Few-Shot Control (UFC), a versatile few-shot control adapter capable of generalizing to novel spatial conditions. Given a few image-condition pairs of an unseen task and a query condition, UFC leverages the analogy between query and support conditions to construct task-specific control features, instantiated by a matching mechanism and an update on a small set of task-specific parameters. Experiments on six novel spatial control tasks show that UFC, fine-tuned with only 30 annotated examples of novel tasks, achieves fine-grained control consistent with the spatial conditions. Notably, when fine-tuned with 0.1% of the full training data, UFC achieves competitive performance with the fully supervised baselines in various control tasks. We also show that UFC is applicable agnostically to various diffusion backbones and demonstrate its effectiveness on both UNet and DiT architectures. Code is available at https://github.com/kietngt00/UFC.

Universal Few-Shot Spatial Control for Diffusion Models

TL;DR

This work introduces Universal Few-shot Control (UFC), a unified, data-efficient framework for steering frozen diffusion models with unseen spatial conditions. UFC leverages patch-wise matching over a small support set to interpolate task-specific control features, combined with episodic meta-training and parameter-efficient fine-tuning to generalize across diverse spatial modalities. In extensive experiments across six spatial tasks and two backbones (UNet and DiT), UFC delivers strong few-shot controllability with as little as 30 annotated examples and remains competitive with fully supervised baselines at 0.1% of training data, while maintaining solid image quality. The approach offers practical, versatile spatial control for diffusion models, highlighting its potential for flexible content generation under limited labeled data and across architectures. Limitations include focus on spatial control rather than appearance-preserving tasks and the need for some fine-tuning for new controls, suggesting avenues for future research in in-context-like adaptation and broader task applicability.

Abstract

Spatial conditioning in pretrained text-to-image diffusion models has significantly improved fine-grained control over the structure of generated images. However, existing control adapters exhibit limited adaptability and incur high training costs when encountering novel spatial control conditions that differ substantially from the training tasks. To address this limitation, we propose Universal Few-Shot Control (UFC), a versatile few-shot control adapter capable of generalizing to novel spatial conditions. Given a few image-condition pairs of an unseen task and a query condition, UFC leverages the analogy between query and support conditions to construct task-specific control features, instantiated by a matching mechanism and an update on a small set of task-specific parameters. Experiments on six novel spatial control tasks show that UFC, fine-tuned with only 30 annotated examples of novel tasks, achieves fine-grained control consistent with the spatial conditions. Notably, when fine-tuned with 0.1% of the full training data, UFC achieves competitive performance with the fully supervised baselines in various control tasks. We also show that UFC is applicable agnostically to various diffusion backbones and demonstrate its effectiveness on both UNet and DiT architectures. Code is available at https://github.com/kietngt00/UFC.

Paper Structure

This paper contains 53 sections, 12 equations, 15 figures, 6 tables.

Figures (15)

  • Figure 1: Results of our method learned with 30 examples on unseen spatial conditions. The proposed control adapter guides the pre-trained T2I models in a versatile and data-efficient manner.
  • Figure 2: Overview of the proposed framework. The control adapter $\mathcal{I}$ consists of an image encoder $f$, a condition encoder $g_\tau$, and a matching module implementing Eq. \ref{['eq:matching']}. The support image-condition pairs and the query conditions are encoded to extract multi-layer features. The matching module at each layer is applied to produce control features. The features are then injected into the generation process following the mechanism in Section \ref{['sec:architecture']} to control the structure of images.
  • Figure 3: Qualitative comparison across six spatial control tasks. Our method (highlighted in black boxes), fine-tuned with 30-shot on unseen tasks, demonstrates competitive controllability with fully supervised baselines. In contrast, other baselines struggle to follow the spatial guidance accurately.
  • Figure 4: Performance of UFC when fine-tuned with different numbers of support data. Overall, UFC consistently improves the controllability with the increasing size of support sets. The results on FID are presented in the Appendix \ref{['supp:shots']}, Figure \ref{['supp:fig:fid']}.
  • Figure 5: Qualitative comparison between UFC (30-shot) and FreeControl mo2024freecontrol, Ctrl-X ctrl_x on four control tasks.
  • ...and 10 more figures