Towards Unified Semantic and Controllable Image Fusion: A Diffusion Transformer Approach
Jiayang Li, Chengjie Jiang, Junjun Jiang, Pengwei Liang, Jiayi Ma, Liqiang Nie
TL;DR
This work tackles the fragility and task-specificity of existing image fusion methods by introducing DiTFuse, a Diffusion-Transformer that jointly encodes two images and natural-language instructions in a shared latent space. It achieves end-to-end, semantics-aware fusion across infrared-visible, multi-focus, and multi-exposure scenarios, and extends to text-controlled refinement and segmentation via a multi-task training regime called M3. Core contributions include the DiT-based architecture with LoRA adaptation, a flow-matching training objective, M3 data generation, and a unified instruction-driven learning framework that supports zero-shot generalization. The approach demonstrates state-of-the-art performance across several benchmarks, along with robust text-guided control and a scalable data construction pipeline for fusion tasks without ground-truth fused images.
Abstract
Image fusion aims to blend complementary information from multiple sensing modalities, yet existing approaches remain limited in robustness, adaptability, and controllability. Most current fusion networks are tailored to specific tasks and lack the ability to flexibly incorporate user intent, especially in complex scenarios involving low-light degradation, color shifts, or exposure imbalance. Moreover, the absence of ground-truth fused images and the small scale of existing datasets make it difficult to train an end-to-end model that simultaneously understands high-level semantics and performs fine-grained multimodal alignment. We therefore present DiTFuse, instruction-driven Diffusion-Transformer (DiT) framework that performs end-to-end, semantics-aware fusion within a single model. By jointly encoding two images and natural-language instructions in a shared latent space, DiTFuse enables hierarchical and fine-grained control over fusion dynamics, overcoming the limitations of pre-fusion and post-fusion pipelines that struggle to inject high-level semantics. The training phase employs a multi-degradation masked-image modeling strategy, so the network jointly learns cross-modal alignment, modality-invariant restoration, and task-aware feature selection without relying on ground truth images. A curated, multi-granularity instruction dataset further equips the model with interactive fusion capabilities. DiTFuse unifies infrared-visible, multi-focus, and multi-exposure fusion-as well as text-controlled refinement and downstream tasks-within a single architecture. Experiments on public IVIF, MFF, and MEF benchmarks confirm superior quantitative and qualitative performance, sharper textures, and better semantic retention. The model also supports multi-level user control and zero-shot generalization to other multi-image fusion scenarios, including instruction-conditioned segmentation.
