Table of Contents
Fetching ...

Rethinking the Evaluation of Visible and Infrared Image Fusion

Dayan Guan, Yixuan Wu, Tianzhu Liu, Alex C. Kot, Yanfeng Gu

TL;DR

This work tackles the evaluation bottleneck of Visible-Infrared Image Fusion (VIF) by introducing a Segmentation-oriented Evaluation Approach (SEA) that uses universal segmentation models to assess fused images without ground-truth references. SEA fuses inputs, predicts semantic segmentation with models like X-Decoder, SEEM, and G-SAM, and computes mean IoU ($mIoU$) against labeled segmentation maps, enabling cross-dataset and cross-method comparisons. In extensive experiments on the FMB and MVSeg datasets, SEA shows that many recent VIF methods do not outperform simply using visible images, even when infrared data appears informativeness-rich; the study also identifies $Q_{ ext{ABF}}$ and $Q_{ ext{VIFF}}$ as the conventional metrics most correlated with SEA, offering practical proxies when labels are unavailable. The results call for a reorientation of VIF research toward semantically consistent fusion and provide a scalable, dataset-agnostic evaluation framework to guide future method development. The work contributes (1) a universally applicable SEA framework, (2) a comprehensive comparative study of 30 recent VIF methods across two large datasets, and (3) a correlation analysis linking SEA to traditional metrics to inform proxy-based evaluation.

Abstract

Visible and Infrared Image Fusion (VIF) has garnered significant interest across a wide range of high-level vision tasks, such as object detection and semantic segmentation. However, the evaluation of VIF methods remains challenging due to the absence of ground truth. This paper proposes a Segmentation-oriented Evaluation Approach (SEA) to assess VIF methods by incorporating the semantic segmentation task and leveraging segmentation labels available in latest VIF datasets. Specifically, SEA utilizes universal segmentation models, capable of handling diverse images and classes, to predict segmentation outputs from fused images and compare these outputs with segmentation labels. Our evaluation of recent VIF methods using SEA reveals that their performance is comparable or even inferior to using visible images only, despite nearly half of the infrared images demonstrating better performance than visible images. Further analysis indicates that the two metrics most correlated to our SEA are the gradient-based fusion metric $Q_{\text{ABF}}$ and the visual information fidelity metric $Q_{\text{VIFF}}$ in conventional VIF evaluation metrics, which can serve as proxies when segmentation labels are unavailable. We hope that our evaluation will guide the development of novel and practical VIF methods. The code has been released in \url{https://github.com/Yixuan-2002/SEA/}.

Rethinking the Evaluation of Visible and Infrared Image Fusion

TL;DR

This work tackles the evaluation bottleneck of Visible-Infrared Image Fusion (VIF) by introducing a Segmentation-oriented Evaluation Approach (SEA) that uses universal segmentation models to assess fused images without ground-truth references. SEA fuses inputs, predicts semantic segmentation with models like X-Decoder, SEEM, and G-SAM, and computes mean IoU () against labeled segmentation maps, enabling cross-dataset and cross-method comparisons. In extensive experiments on the FMB and MVSeg datasets, SEA shows that many recent VIF methods do not outperform simply using visible images, even when infrared data appears informativeness-rich; the study also identifies and as the conventional metrics most correlated with SEA, offering practical proxies when labels are unavailable. The results call for a reorientation of VIF research toward semantically consistent fusion and provide a scalable, dataset-agnostic evaluation framework to guide future method development. The work contributes (1) a universally applicable SEA framework, (2) a comprehensive comparative study of 30 recent VIF methods across two large datasets, and (3) a correlation analysis linking SEA to traditional metrics to inform proxy-based evaluation.

Abstract

Visible and Infrared Image Fusion (VIF) has garnered significant interest across a wide range of high-level vision tasks, such as object detection and semantic segmentation. However, the evaluation of VIF methods remains challenging due to the absence of ground truth. This paper proposes a Segmentation-oriented Evaluation Approach (SEA) to assess VIF methods by incorporating the semantic segmentation task and leveraging segmentation labels available in latest VIF datasets. Specifically, SEA utilizes universal segmentation models, capable of handling diverse images and classes, to predict segmentation outputs from fused images and compare these outputs with segmentation labels. Our evaluation of recent VIF methods using SEA reveals that their performance is comparable or even inferior to using visible images only, despite nearly half of the infrared images demonstrating better performance than visible images. Further analysis indicates that the two metrics most correlated to our SEA are the gradient-based fusion metric and the visual information fidelity metric in conventional VIF evaluation metrics, which can serve as proxies when segmentation labels are unavailable. We hope that our evaluation will guide the development of novel and practical VIF methods. The code has been released in \url{https://github.com/Yixuan-2002/SEA/}.

Paper Structure

This paper contains 7 sections, 8 figures, 8 tables.

Figures (8)

  • Figure 1: Evaluating the quality of fused images in VIF poses a significant challenge due to the lack of ground truth. To address this challenge, this paper proposes a novel segmentation-oriented evaluation approach that leverages a semantic segmentation task for assessing the quality of fused images. The underlying reason is that better segmentation performance indicates better fusion quality due to the intrinsic consistency between visual and semantic information zhang2024mrfs. To illustrate, the first row shows the fused image generated from visible and infrared images using latest VIF methods TIM liu2024task and SDCFusion liu2024semantic, while the second row presents the corresponding segmentation label and outputs (from TIM and SDCFusion) predicted by the state-of-the-art universal segmentation model X-Decoder zou2023generalized, with the last row showing the color palette for different classes.
  • Figure 2: Overview of current universal segmentation models featuring an image/text encoder-decoder architecture. The encoders are designed to process diverse image inputs (across various styles and modalities) and text inputs (including different class names or queries). The decoder is capable of performing multiple high-level vision tasks, such as semantic segmentation, instance segmentation, referring segmentation, and etc. Our proposed SEA leverages the semantic segmentation task capabilities of current universal segmentation models to evaluate the quality of VIF fused images.
  • Figure 3: Image quality assessment of latest VIF methods (TIM liu2024task and SDCFusion liu2024semantic) using our proposed SEA alongside 3 widely used evaluation metrics including Entropy (EN), Standard Deviation (SD) and SSIM. Our SEA demonstrates superior performance on both the FMB and MVSeg datasets. Note that indicates better performance, while indicates worse performance.
  • Figure 4: Comparisons of segmentation frameworks used in non-unified/unified VIF methods and our proposed SEA. Non-unified VIF methods involve training separate models for image fusion and segmentation, using the latter for subsequent evaluation. Conversely, integrate the training of fusion and segmentation models, employing a segmentation loss to refine the fusion process. In a comparison, our proposed SEA eliminates the requirement for additional training of the segmentation model.
  • Figure 5: Performance differences between infrared and visible images. The mIoU is computed by subtracting the performance of each infrared image from that of its corresponding visible image across the FMB and MVSeg datasets. Green bars indicate cases where infrared images outperform visible images, while blue bars represent the opposite scenario.
  • ...and 3 more figures