Table of Contents
Fetching ...

Towards Unified Semantic and Controllable Image Fusion: A Diffusion Transformer Approach

Jiayang Li, Chengjie Jiang, Junjun Jiang, Pengwei Liang, Jiayi Ma, Liqiang Nie

TL;DR

This work tackles the fragility and task-specificity of existing image fusion methods by introducing DiTFuse, a Diffusion-Transformer that jointly encodes two images and natural-language instructions in a shared latent space. It achieves end-to-end, semantics-aware fusion across infrared-visible, multi-focus, and multi-exposure scenarios, and extends to text-controlled refinement and segmentation via a multi-task training regime called M3. Core contributions include the DiT-based architecture with LoRA adaptation, a flow-matching training objective, M3 data generation, and a unified instruction-driven learning framework that supports zero-shot generalization. The approach demonstrates state-of-the-art performance across several benchmarks, along with robust text-guided control and a scalable data construction pipeline for fusion tasks without ground-truth fused images.

Abstract

Image fusion aims to blend complementary information from multiple sensing modalities, yet existing approaches remain limited in robustness, adaptability, and controllability. Most current fusion networks are tailored to specific tasks and lack the ability to flexibly incorporate user intent, especially in complex scenarios involving low-light degradation, color shifts, or exposure imbalance. Moreover, the absence of ground-truth fused images and the small scale of existing datasets make it difficult to train an end-to-end model that simultaneously understands high-level semantics and performs fine-grained multimodal alignment. We therefore present DiTFuse, instruction-driven Diffusion-Transformer (DiT) framework that performs end-to-end, semantics-aware fusion within a single model. By jointly encoding two images and natural-language instructions in a shared latent space, DiTFuse enables hierarchical and fine-grained control over fusion dynamics, overcoming the limitations of pre-fusion and post-fusion pipelines that struggle to inject high-level semantics. The training phase employs a multi-degradation masked-image modeling strategy, so the network jointly learns cross-modal alignment, modality-invariant restoration, and task-aware feature selection without relying on ground truth images. A curated, multi-granularity instruction dataset further equips the model with interactive fusion capabilities. DiTFuse unifies infrared-visible, multi-focus, and multi-exposure fusion-as well as text-controlled refinement and downstream tasks-within a single architecture. Experiments on public IVIF, MFF, and MEF benchmarks confirm superior quantitative and qualitative performance, sharper textures, and better semantic retention. The model also supports multi-level user control and zero-shot generalization to other multi-image fusion scenarios, including instruction-conditioned segmentation.

Towards Unified Semantic and Controllable Image Fusion: A Diffusion Transformer Approach

TL;DR

This work tackles the fragility and task-specificity of existing image fusion methods by introducing DiTFuse, a Diffusion-Transformer that jointly encodes two images and natural-language instructions in a shared latent space. It achieves end-to-end, semantics-aware fusion across infrared-visible, multi-focus, and multi-exposure scenarios, and extends to text-controlled refinement and segmentation via a multi-task training regime called M3. Core contributions include the DiT-based architecture with LoRA adaptation, a flow-matching training objective, M3 data generation, and a unified instruction-driven learning framework that supports zero-shot generalization. The approach demonstrates state-of-the-art performance across several benchmarks, along with robust text-guided control and a scalable data construction pipeline for fusion tasks without ground-truth fused images.

Abstract

Image fusion aims to blend complementary information from multiple sensing modalities, yet existing approaches remain limited in robustness, adaptability, and controllability. Most current fusion networks are tailored to specific tasks and lack the ability to flexibly incorporate user intent, especially in complex scenarios involving low-light degradation, color shifts, or exposure imbalance. Moreover, the absence of ground-truth fused images and the small scale of existing datasets make it difficult to train an end-to-end model that simultaneously understands high-level semantics and performs fine-grained multimodal alignment. We therefore present DiTFuse, instruction-driven Diffusion-Transformer (DiT) framework that performs end-to-end, semantics-aware fusion within a single model. By jointly encoding two images and natural-language instructions in a shared latent space, DiTFuse enables hierarchical and fine-grained control over fusion dynamics, overcoming the limitations of pre-fusion and post-fusion pipelines that struggle to inject high-level semantics. The training phase employs a multi-degradation masked-image modeling strategy, so the network jointly learns cross-modal alignment, modality-invariant restoration, and task-aware feature selection without relying on ground truth images. A curated, multi-granularity instruction dataset further equips the model with interactive fusion capabilities. DiTFuse unifies infrared-visible, multi-focus, and multi-exposure fusion-as well as text-controlled refinement and downstream tasks-within a single architecture. Experiments on public IVIF, MFF, and MEF benchmarks confirm superior quantitative and qualitative performance, sharper textures, and better semantic retention. The model also supports multi-level user control and zero-shot generalization to other multi-image fusion scenarios, including instruction-conditioned segmentation.

Paper Structure

This paper contains 31 sections, 2 equations, 20 figures, 8 tables.

Figures (20)

  • Figure 1: (a) corresponds to pre-fusion. (b) corresponds to post-fusion. (c) represents our DiT-based method; The images in the first row display the fusion results of two different architectures under over-exposure conditions.
  • Figure 2: Comparison of fusion method capabilities. Task-specific methods (e.g., MFFGAN zhang2021mff, CRMEF liu2023embracing, SeAFusion seafusion) handle only one fusion type. All-in-one methods (e.g., U2Fusion u2fusion) unify tasks but lack interactivity. Our method, DiTFuse, not only supports unified omni-fusion across MFF, MEF, and IVIF, but also enables text-guided fusion and instruction-following segmentation.
  • Figure 3: The framework of our model. Textual control information is encoded through the Text Tokenizer, and image information is encoded into Visual Embeddings via VAE. These are used together as conditional information to control the denoising process of DiT. The left half of the diagram represents the training stage, while the right half represents the inference stage. During the training stage, we primarily use the M3 method to constrain the model, and during the inference stage, it can work on multiple fusion tasks.
  • Figure 4: Training data construction pipeline. The blue box illustrates the input images generation process, which follows the Multi-degradation Mask image Modeling (M3) strategy. The orange box presents the Ground Truths corresponding to different instructions. For M3 training data, the Ground Truth is the source image itself. For control-type instructions, the Ground Truth is created by adjusting the image’s overall contrast or brightness. For segmentation instructions, the Ground Truth is generated by overlaying a blue transparent mask on the target class.
  • Figure 5: The diagram is a sunburst chart showing the data composition. The inner ring displays the four main data categories, and the outer ring shows the proportion of each specific dataset within those categories.
  • ...and 15 more figures