Table of Contents
Fetching ...

DreamFuse: Adaptive Image Fusion with Diffusion Transformer

Junjia Huang, Pengxiang Yan, Jiyang Liu, Jie Wu, Zhao Wang, Yitong Wang, Liang Lin, Guanbin Li

TL;DR

DreamFuse tackles adaptive, interactive image fusion by combining an iterative human-in-the-loop data-generation pipeline with a Diffusion Transformer-based fusion model. It introduces a positional affine mechanism and shared attention to tightly couple foreground and background information, and leverages Localized Direct Preference Optimization to align outputs with human preferences. The approach supports text-driven attribute editing of fused scenes and demonstrates strong performance against state-of-the-art methods across multiple benchmarks, including real-world data. The work advances realistic, controllable fusion in practical applications and highlights avenues for further improving foreground-background consistency.

Abstract

Image fusion seeks to seamlessly integrate foreground objects with background scenes, producing realistic and harmonious fused images. Unlike existing methods that directly insert objects into the background, adaptive and interactive fusion remains a challenging yet appealing task. It requires the foreground to adjust or interact with the background context, enabling more coherent integration. To address this, we propose an iterative human-in-the-loop data generation pipeline, which leverages limited initial data with diverse textual prompts to generate fusion datasets across various scenarios and interactions, including placement, holding, wearing, and style transfer. Building on this, we introduce DreamFuse, a novel approach based on the Diffusion Transformer (DiT) model, to generate consistent and harmonious fused images with both foreground and background information. DreamFuse employs a Positional Affine mechanism to inject the size and position of the foreground into the background, enabling effective foreground-background interaction through shared attention. Furthermore, we apply Localized Direct Preference Optimization guided by human feedback to refine DreamFuse, enhancing background consistency and foreground harmony. DreamFuse achieves harmonious fusion while generalizing to text-driven attribute editing of the fused results. Experimental results demonstrate that our method outperforms state-of-the-art approaches across multiple metrics.

DreamFuse: Adaptive Image Fusion with Diffusion Transformer

TL;DR

DreamFuse tackles adaptive, interactive image fusion by combining an iterative human-in-the-loop data-generation pipeline with a Diffusion Transformer-based fusion model. It introduces a positional affine mechanism and shared attention to tightly couple foreground and background information, and leverages Localized Direct Preference Optimization to align outputs with human preferences. The approach supports text-driven attribute editing of fused scenes and demonstrates strong performance against state-of-the-art methods across multiple benchmarks, including real-world data. The work advances realistic, controllable fusion in practical applications and highlights avenues for further improving foreground-background consistency.

Abstract

Image fusion seeks to seamlessly integrate foreground objects with background scenes, producing realistic and harmonious fused images. Unlike existing methods that directly insert objects into the background, adaptive and interactive fusion remains a challenging yet appealing task. It requires the foreground to adjust or interact with the background context, enabling more coherent integration. To address this, we propose an iterative human-in-the-loop data generation pipeline, which leverages limited initial data with diverse textual prompts to generate fusion datasets across various scenarios and interactions, including placement, holding, wearing, and style transfer. Building on this, we introduce DreamFuse, a novel approach based on the Diffusion Transformer (DiT) model, to generate consistent and harmonious fused images with both foreground and background information. DreamFuse employs a Positional Affine mechanism to inject the size and position of the foreground into the background, enabling effective foreground-background interaction through shared attention. Furthermore, we apply Localized Direct Preference Optimization guided by human feedback to refine DreamFuse, enhancing background consistency and foreground harmony. DreamFuse achieves harmonious fusion while generalizing to text-driven attribute editing of the fused results. Experimental results demonstrate that our method outperforms state-of-the-art approaches across multiple metrics.

Paper Structure

This paper contains 25 sections, 9 equations, 20 figures, 6 tables, 1 algorithm.

Figures (20)

  • Figure 1: DreamFuse demonstrates adaptive performance across diverse scenarios, including style transfer, wearable items, logo printing, placement and handheld. Notably, when given a text prompt, our method effectively responds by further editing the attributes of the foreground object (e.g., a golden car).
  • Figure 2: The framework of the data generation model and position matching process. The left side of the image illustrates the design structure of our data generation model, while the right side shows the position matching process and data format. We enhance the diversity of fused data generation through flexible and rich prompts combined with various style LoRAs.
  • Figure 3: The framework of the DreamFuse. We apply positional affine transformations to map the foreground's position and size onto the background. The foreground and background are concatenated with the noisy fused image as condition images before DiT's attention layers. Localized direct preference optimization is then used to improve background consistency and foreground harmony.
  • Figure 4: Three ways for injecting positional conditions: (a) using positional affine to map the foreground's position index to its target placement; (b) directly transforming the foreground object to the target position; (c) encoding position mask information with a tokenizer and integrating it into DiT's attention computation.
  • Figure 5: Scene distribution of the fusion dataset, including scenario counts, indoor/outdoor background proportions, and complexity levels.
  • ...and 15 more figures