Table of Contents
Fetching ...

Bridging Human Evaluation to Infrared and Visible Image Fusion

Jinyuan Liu, Xingyuan Li, Qingyun Mei, Haoyuan Xu, Zhiying Jiang, Long Ma, Risheng Liu, Xin Fan

TL;DR

This work introduces the first large-scale human feedback dataset for IVIF, containing multidimensional subjective scores and artifact annotations, and enriched by a fine-tuned large language model with expert review, and designs a domain-specific reward function and train a reward model to quantify perceptual quality.

Abstract

Infrared and visible image fusion (IVIF) integrates complementary modalities to enhance scene perception. Current methods predominantly focus on optimizing handcrafted losses and objective metrics, often resulting in fusion outcomes that do not align with human visual preferences. This challenge is further exacerbated by the ill-posed nature of IVIF, which severely limits its effectiveness in human perceptual environments such as security surveillance and driver assistance systems. To address these limitations, we propose a feedback reinforcement framework that bridges human evaluation to infrared and visible image fusion. To address the lack of human-centric evaluation metrics and data, we introduce the first large-scale human feedback dataset for IVIF, containing multidimensional subjective scores and artifact annotations, and enriched by a fine-tuned large language model with expert review. Based on this dataset, we design a domain-specific reward function and train a reward model to quantify perceptual quality. Guided by this reward, we fine-tune the fusion network through Group Relative Policy Optimization, achieving state-of-the-art performance that better aligns fused images with human aesthetics. Code is available at https://github.com/ALKA-Wind/EVAFusion.

Bridging Human Evaluation to Infrared and Visible Image Fusion

TL;DR

This work introduces the first large-scale human feedback dataset for IVIF, containing multidimensional subjective scores and artifact annotations, and enriched by a fine-tuned large language model with expert review, and designs a domain-specific reward function and train a reward model to quantify perceptual quality.

Abstract

Infrared and visible image fusion (IVIF) integrates complementary modalities to enhance scene perception. Current methods predominantly focus on optimizing handcrafted losses and objective metrics, often resulting in fusion outcomes that do not align with human visual preferences. This challenge is further exacerbated by the ill-posed nature of IVIF, which severely limits its effectiveness in human perceptual environments such as security surveillance and driver assistance systems. To address these limitations, we propose a feedback reinforcement framework that bridges human evaluation to infrared and visible image fusion. To address the lack of human-centric evaluation metrics and data, we introduce the first large-scale human feedback dataset for IVIF, containing multidimensional subjective scores and artifact annotations, and enriched by a fine-tuned large language model with expert review. Based on this dataset, we design a domain-specific reward function and train a reward model to quantify perceptual quality. Guided by this reward, we fine-tune the fusion network through Group Relative Policy Optimization, achieving state-of-the-art performance that better aligns fused images with human aesthetics. Code is available at https://github.com/ALKA-Wind/EVAFusion.
Paper Structure (15 sections, 5 equations, 9 figures, 5 tables)

This paper contains 15 sections, 5 equations, 9 figures, 5 tables.

Figures (9)

  • Figure 1: Illustration of our pipeline. The fine-tuning module is based on RLHF, integrating the ViT-based fusion-oriented reward model trained with the Human Feedback Dataset. A segmentation-assisted mechanism, based on GRPO, is introduced for fine-tuning, aiming to improve the quality of the fused images.
  • Figure 2: Workflow of the dataset collection. The dataset is processed and clustered to optimize representative images, followed by expert screening. Use fusion models to fuse $I_{vis}$ and $I_{inf}$ images, with experts scoring and annotating a subset as prior knowledge for aligning the GPT model. After GPT annotates all images, experts review the results, resulting in the human feedback IVIF dataset.
  • Figure 3: Overview of our collected dataset. (a) Data diversity: samples from multiple benchmark datasets; (b) Label diversity: each sample contains fine-grained scores across four quality dimensions and artifact heatmaps; (c) Scene diversity: covering key semantic categories including people, cars, buildings, roads, and more.
  • Figure 4: Qualitative comparison of our method with existing image fusion methods. From top to bottom: RoadScene, M$^{3}$FD, TNO.
  • Figure 5: Preference ranking heatmaps on three datasets. Yellow represents the most preferred, and purple represents the least preferred.
  • ...and 4 more figures