Table of Contents
Fetching ...

Visible and Infrared Image Fusion Using Encoder-Decoder Network

Ferhat Can Ataman, Gözde Bozdaği Akar

TL;DR

The paper addresses infrared-visible image fusion by introducing a lightweight, end-to-end encoder–decoder network composed of dual encoders, a fusion module, and a decoder. Fusion is achieved through per-layer channel-wise aggregation using 1x1 convolutions, with skip connections in the decoder to reconstruct the fused image. A no-reference loss based on PaQ-derived quality metrics, supplemented by MSE terms, guides training, enabling high perceptual quality without ground-truth fused images. Empirical results show state-of-the-art perceptual quality with significantly faster inference (≈178 FPS), making the method suitable for real-time embedded vision tasks. The work demonstrates practical impact for applications like object detection and tracking in multimodal scenes while highlighting potential for deployment on edge devices and future dataset expansion.

Abstract

The aim of multispectral image fusion is to combine object or scene features of images with different spectral characteristics to increase the perceptual quality. In this paper, we present a novel learning-based solution to image fusion problem focusing on infrared and visible spectrum images. The proposed solution utilizes only convolution and pooling layers together with a loss function using no-reference quality metrics. The analysis is performed qualitatively and quantitatively on various datasets. The results show better performance than state-of-the-art methods. Also, the size of our network enables real-time performance on embedded devices. Project codes can be found at \url{https://github.com/ferhatcan/pyFusionSR}.

Visible and Infrared Image Fusion Using Encoder-Decoder Network

TL;DR

The paper addresses infrared-visible image fusion by introducing a lightweight, end-to-end encoder–decoder network composed of dual encoders, a fusion module, and a decoder. Fusion is achieved through per-layer channel-wise aggregation using 1x1 convolutions, with skip connections in the decoder to reconstruct the fused image. A no-reference loss based on PaQ-derived quality metrics, supplemented by MSE terms, guides training, enabling high perceptual quality without ground-truth fused images. Empirical results show state-of-the-art perceptual quality with significantly faster inference (≈178 FPS), making the method suitable for real-time embedded vision tasks. The work demonstrates practical impact for applications like object detection and tracking in multimodal scenes while highlighting potential for deployment on edge devices and future dataset expansion.

Abstract

The aim of multispectral image fusion is to combine object or scene features of images with different spectral characteristics to increase the perceptual quality. In this paper, we present a novel learning-based solution to image fusion problem focusing on infrared and visible spectrum images. The proposed solution utilizes only convolution and pooling layers together with a loss function using no-reference quality metrics. The analysis is performed qualitatively and quantitatively on various datasets. The results show better performance than state-of-the-art methods. Also, the size of our network enables real-time performance on embedded devices. Project codes can be found at \url{https://github.com/ferhatcan/pyFusionSR}.

Paper Structure

This paper contains 9 sections, 3 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: The proposed network architecture. The first part includes two identical encoder networks that correspond to infrared and visible images respectively. Then, extracted features are fused using 1x1 convolutional layers(green boxes in the middle). The fused features are used to construct a final image in the decoder network.
  • Figure 2: Quantitative comparison of several methods according to PaQ-2-PiQPaQ scores and Qwqw scores. Bigger values mean better results for both metrics. 21 test image pairs are used given in TNOTNO.
  • Figure 3: Qualitative comparison of several methods. Methods names are from left to right GFF gff, HMSD_GF hmsd-gf, Hybrid_MSD Hybrid-MSD, TIF tif, and proposed respectively. Image sequence names for the first 3 rows are carWhite, fight and labMan. The remaining rows are zoomed areas labeled in red rectangles in each sequence.
  • Figure 4: Qualitative comparison of several methods. Methods names are from left to right DLF DLF, Hybrid_MSD Hybrid-MSD, Dual_Branch dualbranch, Rfn-Nest rfnnest, DeepFuse deepfuse and proposed respectively. Image sequences are taken from TNO dataset TNO.