Table of Contents
Fetching ...

CrossFuse: A Novel Cross Attention Mechanism based Infrared and Visible Image Fusion Approach

Hui Li, Xiao-Jun Wu

TL;DR

A novel cross attention mechanism (CAM) is proposed to enhance the complementary information in multi-sensor visual information fusion and obtains the SOTA fusion performance compared with the existing fusion networks.

Abstract

Multimodal visual information fusion aims to integrate the multi-sensor data into a single image which contains more complementary information and less redundant features. However the complementary information is hard to extract, especially for infrared and visible images which contain big similarity gap between these two modalities. The common cross attention modules only consider the correlation, on the contrary, image fusion tasks need focus on complementarity (uncorrelation). Hence, in this paper, a novel cross attention mechanism (CAM) is proposed to enhance the complementary information. Furthermore, a two-stage training strategy based fusion scheme is presented to generate the fused images. For the first stage, two auto-encoder networks with same architecture are trained for each modality. Then, with the fixed encoders, the CAM and a decoder are trained in the second stage. With the trained CAM, features extracted from two modalities are integrated into one fused feature in which the complementary information is enhanced and the redundant features are reduced. Finally, the fused image can be generated by the trained decoder. The experimental results illustrate that our proposed fusion method obtains the SOTA fusion performance compared with the existing fusion networks. The codes are available at https://github.com/hli1221/CrossFuse

CrossFuse: A Novel Cross Attention Mechanism based Infrared and Visible Image Fusion Approach

TL;DR

A novel cross attention mechanism (CAM) is proposed to enhance the complementary information in multi-sensor visual information fusion and obtains the SOTA fusion performance compared with the existing fusion networks.

Abstract

Multimodal visual information fusion aims to integrate the multi-sensor data into a single image which contains more complementary information and less redundant features. However the complementary information is hard to extract, especially for infrared and visible images which contain big similarity gap between these two modalities. The common cross attention modules only consider the correlation, on the contrary, image fusion tasks need focus on complementarity (uncorrelation). Hence, in this paper, a novel cross attention mechanism (CAM) is proposed to enhance the complementary information. Furthermore, a two-stage training strategy based fusion scheme is presented to generate the fused images. For the first stage, two auto-encoder networks with same architecture are trained for each modality. Then, with the fixed encoders, the CAM and a decoder are trained in the second stage. With the trained CAM, features extracted from two modalities are integrated into one fused feature in which the complementary information is enhanced and the redundant features are reduced. Finally, the fused image can be generated by the trained decoder. The experimental results illustrate that our proposed fusion method obtains the SOTA fusion performance compared with the existing fusion networks. The codes are available at https://github.com/hli1221/CrossFuse
Paper Structure (24 sections, 12 equations, 16 figures, 4 tables)

This paper contains 24 sections, 12 equations, 16 figures, 4 tables.

Figures (16)

  • Figure 1: The correlation and uncorrelation for self-attention in image fusion task. For multimodal images, the self-attention may not be suitable for inter-modality processing. In image fusion task, the redundant information will be enhanced and the complementary features are reduced, which is more obvious when the source images are all in gray-scale.
  • Figure 2: The framework of CrossFuse. Two "Encoder" contain same architecture but different parameters. The cross-attention mechanism (CAM) is utilized to fuse the multimodal features. "SAB" indicates the self-attention block. The fused image can be obtained by "Decoder" with the long connection from encoders.
  • Figure 3: The encoder architecture which contains three blocks: "Conv", "MaxPooling" and "DenseBlock". "Conv" indicates one convolutional layer, "DenseBlock" includes four convolutional layers with dense connection.
  • Figure 4: The cross-attention mechanism architecture. "SA" follows the standard transformer architecture which contains one self-attention block. The "Shift" and "unshift" mean the block shift and shift back operation. "CA" indicates the novel cross-attention mechanism which focuses on the uncorrelation information.
  • Figure 5: The activation function curves of $softmax(\cdot)$ and $re\text{-}softmax(\cdot)$.
  • ...and 11 more figures