Revisiting Generative Infrared and Visible Image Fusion Based on Human Cognitive Laws
Lin Guo, Xiaoqing Luo, Wei Xie, Zhancheng Zhang, Hui Li, Rui Wang, Zhenhua Feng, Xiaoning Song
TL;DR
This work tackles infrared–visible image fusion by reframing it through human cognitive principles and probabilistic reasoning. It introduces HCLFuse, a diffusion-based framework that couples an optimal-transport–driven alignment with a multi-scale variational bottleneck encoder and a physics-guided diffusion process, enabling more interpretable and structurally consistent fusion under uncertainty. The approach yields state-of-the-art results on multiple benchmarks and improves downstream semantic segmentation, while providing formal guarantees via information-theoretic bounds and physically informed constraints. Although powerful, the method relies on well-aligned modal pairs and incurs diffusion-related computational overhead, highlighting a trade-off between quality and practicality in real-time settings.
Abstract
Existing infrared and visible image fusion methods often face the dilemma of balancing modal information. Generative fusion methods reconstruct fused images by learning from data distributions, but their generative capabilities remain limited. Moreover, the lack of interpretability in modal information selection further affects the reliability and consistency of fusion results in complex scenarios. This manuscript revisits the essence of generative image fusion under the inspiration of human cognitive laws and proposes a novel infrared and visible image fusion method, termed HCLFuse. First, HCLFuse investigates the quantification theory of information mapping in unsupervised fusion networks, which leads to the design of a multi-scale mask-regulated variational bottleneck encoder. This encoder applies posterior probability modeling and information decomposition to extract accurate and concise low-level modal information, thereby supporting the generation of high-fidelity structural details. Furthermore, the probabilistic generative capability of the diffusion model is integrated with physical laws, forming a time-varying physical guidance mechanism that adaptively regulates the generation process at different stages, thereby enhancing the ability of the model to perceive the intrinsic structure of data and reducing dependence on data quality. Experimental results show that the proposed method achieves state-of-the-art fusion performance in qualitative and quantitative evaluations across multiple datasets and significantly improves semantic segmentation metrics. This fully demonstrates the advantages of this generative image fusion method, drawing inspiration from human cognition, in enhancing structural consistency and detail quality.
