Rethinking Normalization Strategies and Convolutional Kernels for Multimodal Image Fusion
Dan He, Guofen Wang, Weisheng Li, Yucheng Shu, Wenbo Li, Lijian Yang, Yuping Huang, Feiyan Li
TL;DR
This work rethinks multimodal image fusion by targeting fundamental architectural components rather than solely fusion rules. It introduces LKC-FUNet, a UNet-based model that uses a hybrid IN+GN normalization to preserve sparse, modality-specific details and large-kernel convolutions to expand receptive fields, complemented by a multipath adaptive fusion module for cross-scale feature integration. The method achieves state-of-the-art results across medical (MRI-CT, MRI-PET, MRI-SPECT) and infrared-visible fusion benchmarks, while also showing clear benefits for downstream segmentation tasks. The approach highlights the critical interplay between normalization and convolution in multimodal fusion and points to practical directions for efficiency improvements on edge devices.
Abstract
Multimodal image fusion (MMIF) integrates information from different modalities to obtain a comprehensive image, aiding downstream tasks. However, existing research focuses on complementary information fusion and training strategies, overlooking the critical role of underlying architectural components like normalization and convolution kernels. We reevaluate the UNet architecture for end-to-end MMIF, identifying that widely used batch normalization limits performance by smoothing crucial sparse features. To address this, we propose a hybrid of instance and group normalization to maintain sample independence and reinforce intrinsic feature correlations. Crucially, this strategy facilitates richer feature maps, enabling large kernel convolution to fully leverage its receptive field, enhancing detail preservation. Furthermore, the proposed multi-path adaptive fusion module dynamically calibrates features from varying scales and receptive fields, ensuring effective information transfer. Our method achieves SOTA objective performance on MSRS, M$^3$FD, TNO, and Harvard datasets, producing visually clearer salient objects and lesion areas. Notably, it improves MSRS segmentation mIoU by 8.1\% over the infrared image. This performance stems from a synergistic design of normalization and convolution kernels, which preserves critical sparse features. The code is available at https://github.com/HeDan-11/LKC-FUNet.
