Table of Contents
Fetching ...

EvaNet: Towards More Efficient and Consistent Infrared and Visible Image Fusion Assessment

Chunyang Cheng, Tianyang Xu, Xiao-Jun Wu, Tao Zhou, Hui Li, Zhangyong Tang, Josef Kittler

Abstract

Evaluation is essential in image fusion research, yet most existing metrics are directly borrowed from other vision tasks without proper adaptation. These traditional metrics, often based on complex image transformations, not only fail to capture the true quality of the fusion results but also are computationally demanding. To address these issues, we propose a unified evaluation framework specifically tailored for image fusion. At its core is a lightweight network designed efficiently to approximate widely used metrics, following a divide-and-conquer strategy. Unlike conventional approaches that directly assess similarity between fused and source images, we first decompose the fusion result into infrared and visible components. The evaluation model is then used to measure the degree of information preservation in these separated components, effectively disentangling the fusion evaluation process. During training, we incorporate a contrastive learning strategy and inform our evaluation model by perceptual scene assessment provided by a large language model. Last, we propose the first consistency evaluation framework, which measures the alignment between image fusion metrics and human visual perception, using both independent no-reference scores and downstream tasks performance as objective references. Extensive experiments show that our learning-based evaluation paradigm delivers both superior efficiency (up to 1,000 times faster) and greater consistency across a range of standard image fusion benchmarks. Our code will be publicly available at https://github.com/AWCXV/EvaNet.

EvaNet: Towards More Efficient and Consistent Infrared and Visible Image Fusion Assessment

Abstract

Evaluation is essential in image fusion research, yet most existing metrics are directly borrowed from other vision tasks without proper adaptation. These traditional metrics, often based on complex image transformations, not only fail to capture the true quality of the fusion results but also are computationally demanding. To address these issues, we propose a unified evaluation framework specifically tailored for image fusion. At its core is a lightweight network designed efficiently to approximate widely used metrics, following a divide-and-conquer strategy. Unlike conventional approaches that directly assess similarity between fused and source images, we first decompose the fusion result into infrared and visible components. The evaluation model is then used to measure the degree of information preservation in these separated components, effectively disentangling the fusion evaluation process. During training, we incorporate a contrastive learning strategy and inform our evaluation model by perceptual scene assessment provided by a large language model. Last, we propose the first consistency evaluation framework, which measures the alignment between image fusion metrics and human visual perception, using both independent no-reference scores and downstream tasks performance as objective references. Extensive experiments show that our learning-based evaluation paradigm delivers both superior efficiency (up to 1,000 times faster) and greater consistency across a range of standard image fusion benchmarks. Our code will be publicly available at https://github.com/AWCXV/EvaNet.

Paper Structure

This paper contains 28 sections, 12 equations, 20 figures, 8 tables.

Figures (20)

  • Figure 1: An illustration of the consistency and the efficiency issues (sub-figure (a) and (b)) of existing image fusion metrics. Image fusion evaluation relies heavily on traditional signal processing techniques (e.g., discrete cosine transform (DCT) and wavelet transform) or metrics adapted from other vision tasks. Without appropriate adjustment, these metrics lack consistency, i.e., better visualisation cannot always correspond higher metric values. Moreover, the evaluation phase in current image fusion settings incurs high processing time costs. The proposed EvaNet addresses these issues effectively by significantly reducing the evaluation time and enhancing consistency.
  • Figure 2: The speed (milliseconds per image) imbalance between inference and evaluation in image fusion. Traditional metrics rely on separate, computationally intensive procedures to perform different assessments, significantly slowing down the evaluation phase. In contrast, the proposed EvaNet generates multiple evaluation results simultaneously within a single forward pass, offering acceleration by a factor of up to 1000, and as will be demonstrated, also considerably improved metric consistency.
  • Figure 3: Overview of the proposed EvaNet framework. Our method replaces traditional image fusion assessment processes by a lightweight learning-based network to significantly improve evaluation efficiency. In addition, a divide-and-conquer strategy is used to disentangle and independently measure the information preserved from each source modality. The environment branch, as part of the three-branch design, introduces an adaptive penalty mechanism to mitigate modality imbalance in challenging fusion scenarios.
  • Figure 4: The network architecture of the proposed EvaNet (see Sec. \ref{['sec_network']}). The model consists of two main components. The left part shows the modality-specific decomposition, implemented using two lightweight information probes cheng2025fusionbooster. The right part illustrates the surrogate metric prediction process, consisting of three branches: two modality branches correspond to the infrared and visible inputs, while the environment branch, guided by a large language model (LLM), predicts an adaptive weighting factor ($ENV$) to eliminate any modality imbalance during evaluation.
  • Figure 5: An illustration of perceptual scene environment assessment. Left: some examples showing that the LLVIP-train dataset contains multiple image pairs with limited scene diversity, where foreground objects vary but the background remains consistent. Right: environment label generation process using a Generative Pre-trained Transformer (GPT) OpenAI2025ChatGPT-4o, which estimates scene conditions such as illumination and obscuration to guide the training of the environment branch in EvaNet.
  • ...and 15 more figures