Infrared and Visible Image Fusion with Hierarchical Human Perception

Guang Yang; Jie Li; Xin Liu; Zhusi Zhong; Xinbo Gao

Infrared and Visible Image Fusion with Hierarchical Human Perception

Guang Yang, Jie Li, Xin Liu, Zhusi Zhong, Xinbo Gao

TL;DR

This work introduces an image fusion method, Hierarchical Perception Fusion (HPFusion), which leverages Large Vision-Language Model to incorporate hierarchical human semantic priors, preserving complementary information that satisfies human visual system.

Abstract

Image fusion combines images from multiple domains into one image, containing complementary information from source domains. Existing methods take pixel intensity, texture and high-level vision task information as the standards to determine preservation of information, lacking enhancement for human perception. We introduce an image fusion method, Hierarchical Perception Fusion (HPFusion), which leverages Large Vision-Language Model to incorporate hierarchical human semantic priors, preserving complementary information that satisfies human visual system. We propose multiple questions that humans focus on when viewing an image pair, and answers are generated via the Large Vision-Language Model according to images. The texts of answers are encoded into the fusion network, and the optimization also aims to guide the human semantic distribution of the fused image more similarly to source images, exploring complementary information within the human perception domain. Extensive experiments demonstrate our HPFusoin can achieve high-quality fusion results both for information preservation and human visual enhancement.

Infrared and Visible Image Fusion with Hierarchical Human Perception

TL;DR

Abstract

Paper Structure (14 sections, 6 equations, 4 figures, 1 table)

This paper contains 14 sections, 6 equations, 4 figures, 1 table.

Introduction
Related Work
Infrared and Visible Image Fusion
Large Visual-Language Model
Method
Overview
Architecture
Human Perception Module
Loss Function
Experiments
Qualitative Experiments
Quantitative Experiments
Ablation Studies
Conclusion

Figures (4)

Figure 1: Questions that humans tend to ask when viewing the infrared and visible image pair and corresponding answers generated by LLaVA.
Figure 2: The overall architecture of our fusion network, consisting of Human Perception Module, Cross-attention Block and Fusion Network.
Figure 3: Architecture of the Human Perception Module.
Figure 4: Qualitative comparison of our method with 6 state-of-the-art models on five infrared and visible image pairs of the $M^3FD$ dataset. The first and second columns are infrared and visible images, respectively. From the third to ninth columns are images fused by comparsion methods.

Infrared and Visible Image Fusion with Hierarchical Human Perception

TL;DR

Abstract

Infrared and Visible Image Fusion with Hierarchical Human Perception

Authors

TL;DR

Abstract

Table of Contents

Figures (4)