Table of Contents
Fetching ...

Underlying Semantic Diffusion for Effective and Efficient In-Context Learning

Zhong Ji, Weilong Cao, Yan Zhang, Yanwei Pang, Jungong Han, Xuelong Li

TL;DR

This work tackles the difficulty diffusion models face in preserving underlying semantic structures and leveraging in-context learning across diverse tasks, while also addressing computational efficiency. It introduces Underlying Semantic Diffusion (US-Diffusion), a multi-component framework that integrates Separate & Gather Adapter (SGA), Feedback-Aided Learning (FAL), and Efficient Sampling Strategy (ESS) with a Stable Diffusion backbone and ControlNet to support Map2Image and Image2Map tasks. SGA decouples input conditions by task to enhance in-context learning, FAL provides image-space feedback to guide semantic content capture, and ESS non-uniformly concentrates training and inference on high-noise time steps to accelerate processing. Empirical results demonstrate substantial improvements in FID and RMSE across multiple datasets, along with about a 9.45x speedup in inference, indicating strong generalization to new tasks and datasets and offering a practical, scalable solution for real-time multi-task diffusion-based vision tasks.

Abstract

Diffusion models has emerged as a powerful framework for tasks like image controllable generation and dense prediction. However, existing models often struggle to capture underlying semantics (e.g., edges, textures, shapes) and effectively utilize in-context learning, limiting their contextual understanding and image generation quality. Additionally, high computational costs and slow inference speeds hinder their real-time applicability. To address these challenges, we propose Underlying Semantic Diffusion (US-Diffusion), an enhanced diffusion model that boosts underlying semantics learning, computational efficiency, and in-context learning capabilities on multi-task scenarios. We introduce Separate & Gather Adapter (SGA), which decouples input conditions for different tasks while sharing the architecture, enabling better in-context learning and generalization across diverse visual domains. We also present a Feedback-Aided Learning (FAL) framework, which leverages feedback signals to guide the model in capturing semantic details and dynamically adapting to task-specific contextual cues. Furthermore, we propose a plug-and-play Efficient Sampling Strategy (ESS) for dense sampling at time steps with high-noise levels, which aims at optimizing training and inference efficiency while maintaining strong in-context learning performance. Experimental results demonstrate that US-Diffusion outperforms the state-of-the-art method, achieving an average reduction of 7.47 in FID on Map2Image tasks and an average reduction of 0.026 in RMSE on Image2Map tasks, while achieving approximately 9.45 times faster inference speed. Our method also demonstrates superior training efficiency and in-context learning capabilities, excelling in new datasets and tasks, highlighting its robustness and adaptability across diverse visual domains.

Underlying Semantic Diffusion for Effective and Efficient In-Context Learning

TL;DR

This work tackles the difficulty diffusion models face in preserving underlying semantic structures and leveraging in-context learning across diverse tasks, while also addressing computational efficiency. It introduces Underlying Semantic Diffusion (US-Diffusion), a multi-component framework that integrates Separate & Gather Adapter (SGA), Feedback-Aided Learning (FAL), and Efficient Sampling Strategy (ESS) with a Stable Diffusion backbone and ControlNet to support Map2Image and Image2Map tasks. SGA decouples input conditions by task to enhance in-context learning, FAL provides image-space feedback to guide semantic content capture, and ESS non-uniformly concentrates training and inference on high-noise time steps to accelerate processing. Empirical results demonstrate substantial improvements in FID and RMSE across multiple datasets, along with about a 9.45x speedup in inference, indicating strong generalization to new tasks and datasets and offering a practical, scalable solution for real-time multi-task diffusion-based vision tasks.

Abstract

Diffusion models has emerged as a powerful framework for tasks like image controllable generation and dense prediction. However, existing models often struggle to capture underlying semantics (e.g., edges, textures, shapes) and effectively utilize in-context learning, limiting their contextual understanding and image generation quality. Additionally, high computational costs and slow inference speeds hinder their real-time applicability. To address these challenges, we propose Underlying Semantic Diffusion (US-Diffusion), an enhanced diffusion model that boosts underlying semantics learning, computational efficiency, and in-context learning capabilities on multi-task scenarios. We introduce Separate & Gather Adapter (SGA), which decouples input conditions for different tasks while sharing the architecture, enabling better in-context learning and generalization across diverse visual domains. We also present a Feedback-Aided Learning (FAL) framework, which leverages feedback signals to guide the model in capturing semantic details and dynamically adapting to task-specific contextual cues. Furthermore, we propose a plug-and-play Efficient Sampling Strategy (ESS) for dense sampling at time steps with high-noise levels, which aims at optimizing training and inference efficiency while maintaining strong in-context learning performance. Experimental results demonstrate that US-Diffusion outperforms the state-of-the-art method, achieving an average reduction of 7.47 in FID on Map2Image tasks and an average reduction of 0.026 in RMSE on Image2Map tasks, while achieving approximately 9.45 times faster inference speed. Our method also demonstrates superior training efficiency and in-context learning capabilities, excelling in new datasets and tasks, highlighting its robustness and adaptability across diverse visual domains.

Paper Structure

This paper contains 28 sections, 15 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Comparison of PromptDiffpromptdiffusion and our US-Diffusion model. The upper part shows how PromptDiff generates a segmentation map from a query image given an example pair. It requires 100 denoising steps, and the resulting map lacks consistency in structure and object shape with the input query image. The lower part illustrates our US-Diffusion, which generates a segmentation map in just 10 denoising steps, producing a more consistent map in terms of structure and object shape.
  • Figure 2: The framework of our US-Diffusion model. (a) illustrates the conditional input, including an example pair (example source image and example target image), a query image of the same type as the example source image, and a text prompt. (b) depicts the overall architecture of the model. (c) presents the structure of the proposed Separate & Gather Adapter (SGA). (d) shows the Feedback-Aided Learning (FAL) framework, which divides the process into two paths to provide task-specific feedback. Notably, a novel Efficient Sampling Strategy (ESS) is proposed during the noise-adding and denoising stages.
  • Figure 3: The curve of $\alpha^ { 2 } ( t )$ ($t=i/1000$), where 10 discrete time steps out of the 1000 discrete time steps are non-uniformly sampled.
  • Figure 4: The impact of different $\lambda$ values. (a) The impact of different $\lambda$ values on Image2Map tasks. (b) The impact of different $\lambda$ values on Image2Map tasks.
  • Figure 5: Comparison of PromptDiffpromptdiffusion and our US-Diffusion model for Map2Image tasks. From this figure, it is evident that the controllable generation results of our US-Diffusion model align more accurately with the spatial details specified by the input query condition-map images, particularly in terms of object placement and structure.
  • ...and 1 more figures