Table of Contents
Fetching ...

Conti-Fuse: A Novel Continuous Decomposition-based Fusion Framework for Infrared and Visible Images

Hui Li, Haolong Ma, Chunyang Cheng, Zhongwei Shen, Xiaoning Song, Xiao-Jun Wu

TL;DR

Conti-Fuse tackles infrared–visible image fusion by moving beyond coarse base/detail or low-/high-frequency decompositions to a continuous decomposition along a feature-variation trajectory. It introduces the Continuous Decomposition Module (CDM) to generate multiple transition states and the State Transformer (ST) to capture cross-state complementarity, guided by a novel decomposition loss with a scalable Support Decomposition Strategy (SDS). The approach demonstrates superior performance on multiple datasets (MSRS, M3FD, TNO) across diverse metrics and improves downstream multi-modality segmentation, illustrating robust information preservation and texture/detail retention. The results suggest that dense, trajectory-based feature decomposition paired with efficient sampling and attention-driven interaction can substantially advance IVIF quality and applicability in high-level Vision tasks.

Abstract

For better explore the relations of inter-modal and inner-modal, even in deep learning fusion framework, the concept of decomposition plays a crucial role. However, the previous decomposition strategies (base \& detail or low-frequency \& high-frequency) are too rough to present the common features and the unique features of source modalities, which leads to a decline in the quality of the fused images. The existing strategies treat these relations as a binary system, which may not be suitable for the complex generation task (e.g. image fusion). To address this issue, a continuous decomposition-based fusion framework (Conti-Fuse) is proposed. Conti-Fuse treats the decomposition results as few samples along the feature variation trajectory of the source images, extending this concept to a more general state to achieve continuous decomposition. This novel continuous decomposition strategy enhances the representation of complementary information of inter-modal by increasing the number of decomposition samples, thus reducing the loss of critical information. To facilitate this process, the continuous decomposition module (CDM) is introduced to decompose the input into a series continuous components. The core module of CDM, State Transformer (ST), is utilized to efficiently capture the complementary information from source modalities. Furthermore, a novel decomposition loss function is also designed which ensures the smooth progression of the decomposition process while maintaining linear growth in time complexity with respect to the number of decomposition samples. Extensive experiments demonstrate that our proposed Conti-Fuse achieves superior performance compared to the state-of-the-art fusion methods.

Conti-Fuse: A Novel Continuous Decomposition-based Fusion Framework for Infrared and Visible Images

TL;DR

Conti-Fuse tackles infrared–visible image fusion by moving beyond coarse base/detail or low-/high-frequency decompositions to a continuous decomposition along a feature-variation trajectory. It introduces the Continuous Decomposition Module (CDM) to generate multiple transition states and the State Transformer (ST) to capture cross-state complementarity, guided by a novel decomposition loss with a scalable Support Decomposition Strategy (SDS). The approach demonstrates superior performance on multiple datasets (MSRS, M3FD, TNO) across diverse metrics and improves downstream multi-modality segmentation, illustrating robust information preservation and texture/detail retention. The results suggest that dense, trajectory-based feature decomposition paired with efficient sampling and attention-driven interaction can substantially advance IVIF quality and applicability in high-level Vision tasks.

Abstract

For better explore the relations of inter-modal and inner-modal, even in deep learning fusion framework, the concept of decomposition plays a crucial role. However, the previous decomposition strategies (base \& detail or low-frequency \& high-frequency) are too rough to present the common features and the unique features of source modalities, which leads to a decline in the quality of the fused images. The existing strategies treat these relations as a binary system, which may not be suitable for the complex generation task (e.g. image fusion). To address this issue, a continuous decomposition-based fusion framework (Conti-Fuse) is proposed. Conti-Fuse treats the decomposition results as few samples along the feature variation trajectory of the source images, extending this concept to a more general state to achieve continuous decomposition. This novel continuous decomposition strategy enhances the representation of complementary information of inter-modal by increasing the number of decomposition samples, thus reducing the loss of critical information. To facilitate this process, the continuous decomposition module (CDM) is introduced to decompose the input into a series continuous components. The core module of CDM, State Transformer (ST), is utilized to efficiently capture the complementary information from source modalities. Furthermore, a novel decomposition loss function is also designed which ensures the smooth progression of the decomposition process while maintaining linear growth in time complexity with respect to the number of decomposition samples. Extensive experiments demonstrate that our proposed Conti-Fuse achieves superior performance compared to the state-of-the-art fusion methods.
Paper Structure (30 sections, 13 equations, 12 figures, 6 tables)

This paper contains 30 sections, 13 equations, 12 figures, 6 tables.

Figures (12)

  • Figure 1: A schematic of the unified deep feature space between the common SFID(e.g. DeFusion liang2022fusion) methods and our proposed continuous decomposition method.
  • Figure 2: The architecture of Conti-Fuse. (a) The pipeline of proposed method. (b, c, d) The internal structure diagrams for Encoder Block, CDM and Decoder Block in the $l$-th layer, respectively. The input of Encoder Block ($X^{(l)}$) can be visible feature or infrared feature. 'Channel-wise concatenation' and 'State-wise concatenation' refer to concatenation along the channel and state dimensions of the tensors, respectively; 'Linear Transformation' refers to a 1 × 1 convolution, and 'Group Conv' refers to grouped convolution.
  • Figure 3: Illustration of Transition State Wise MHSA (TSWM).
  • Figure 4: An example of $M^{(l)}_c$ in the $l$-th layer when $K=4$. The color depth represents the constraint of distance, with darker colors indicating them closer to 1.
  • Figure 5: An example of SDS in the $l$-th layer when $K=4$. The yellow boxes represent randomly sampled constraints, while the red boxes represent those calculated consistently each time.
  • ...and 7 more figures