Underlying Semantic Diffusion for Effective and Efficient In-Context Learning
Zhong Ji, Weilong Cao, Yan Zhang, Yanwei Pang, Jungong Han, Xuelong Li
TL;DR
This work tackles the difficulty diffusion models face in preserving underlying semantic structures and leveraging in-context learning across diverse tasks, while also addressing computational efficiency. It introduces Underlying Semantic Diffusion (US-Diffusion), a multi-component framework that integrates Separate & Gather Adapter (SGA), Feedback-Aided Learning (FAL), and Efficient Sampling Strategy (ESS) with a Stable Diffusion backbone and ControlNet to support Map2Image and Image2Map tasks. SGA decouples input conditions by task to enhance in-context learning, FAL provides image-space feedback to guide semantic content capture, and ESS non-uniformly concentrates training and inference on high-noise time steps to accelerate processing. Empirical results demonstrate substantial improvements in FID and RMSE across multiple datasets, along with about a 9.45x speedup in inference, indicating strong generalization to new tasks and datasets and offering a practical, scalable solution for real-time multi-task diffusion-based vision tasks.
Abstract
Diffusion models has emerged as a powerful framework for tasks like image controllable generation and dense prediction. However, existing models often struggle to capture underlying semantics (e.g., edges, textures, shapes) and effectively utilize in-context learning, limiting their contextual understanding and image generation quality. Additionally, high computational costs and slow inference speeds hinder their real-time applicability. To address these challenges, we propose Underlying Semantic Diffusion (US-Diffusion), an enhanced diffusion model that boosts underlying semantics learning, computational efficiency, and in-context learning capabilities on multi-task scenarios. We introduce Separate & Gather Adapter (SGA), which decouples input conditions for different tasks while sharing the architecture, enabling better in-context learning and generalization across diverse visual domains. We also present a Feedback-Aided Learning (FAL) framework, which leverages feedback signals to guide the model in capturing semantic details and dynamically adapting to task-specific contextual cues. Furthermore, we propose a plug-and-play Efficient Sampling Strategy (ESS) for dense sampling at time steps with high-noise levels, which aims at optimizing training and inference efficiency while maintaining strong in-context learning performance. Experimental results demonstrate that US-Diffusion outperforms the state-of-the-art method, achieving an average reduction of 7.47 in FID on Map2Image tasks and an average reduction of 0.026 in RMSE on Image2Map tasks, while achieving approximately 9.45 times faster inference speed. Our method also demonstrates superior training efficiency and in-context learning capabilities, excelling in new datasets and tasks, highlighting its robustness and adaptability across diverse visual domains.
