Table of Contents
Fetching ...

BusReF: Infrared-Visible images registration and fusion focus on reconstructible area using one set of features

Zeyang Zhang, Hui Li, Tianyang Xu, Xiaojun Wu, Josef Kittler

TL;DR

BusReF addresses the challenge of misaligned infrared-visible image fusion by unifying registration and fusion in a bus-like framework built on a shared Auto-Encoder backbone. It introduces a reconstructible mask to constrain training to regions that can be recovered under simulated affine and elastic deformations, and a gradient-aware fusion network to guide registration toward preserving edge congruence. The registration component learns affine $( heta)$ and deformation $( orall ext{ deformation }\phi)$ parameters within a joint loss $\,\mathcal{L}_{reg} = \epsilon\mathcal{L}_{MNCC} + \mathcal{L}_{MG}$, while fusion optimizes a weighted SSIM loss plus a gradient loss $\mathcal{L}_{fuse} = \sigma\mathcal{L}_{wsim} + \mathcal{L}_{grad}$. Experiments on RoadScene and NIRScene show BusReF achieves higher reconstructible-region registration quality (e.g., $MNCC$, $MMSE$) and competitive fusion quality, yielding robust IVRF under unaligned inputs with practical implications for downstream tasks.

Abstract

In a scenario where multi-modal cameras are operating together, the problem of working with non-aligned images cannot be avoided. Yet, existing image fusion algorithms rely heavily on strictly registered input image pairs to produce more precise fusion results, as a way to improve the performance of downstream high-level vision tasks. In order to relax this assumption, one can attempt to register images first. However, the existing methods for registering multiple modalities have limitations, such as complex structures and reliance on significant semantic information. This paper aims to address the problem of image registration and fusion in a single framework, called BusRef. We focus on Infrared-Visible image registration and fusion task (IVRF). In this framework, the input unaligned image pairs will pass through three stages: Coarse registration, Fine registration and Fusion. It will be shown that the unified approach enables more robust IVRF. We also propose a novel training and evaluation strategy, involving the use of masks to reduce the influence of non-reconstructible regions on the loss functions, which greatly improves the accuracy and robustness of the fusion task. Last but not least, a gradient-aware fusion network is designed to preserve the complementary information. The advanced performance of this algorithm is demonstrated by

BusReF: Infrared-Visible images registration and fusion focus on reconstructible area using one set of features

TL;DR

BusReF addresses the challenge of misaligned infrared-visible image fusion by unifying registration and fusion in a bus-like framework built on a shared Auto-Encoder backbone. It introduces a reconstructible mask to constrain training to regions that can be recovered under simulated affine and elastic deformations, and a gradient-aware fusion network to guide registration toward preserving edge congruence. The registration component learns affine and deformation parameters within a joint loss , while fusion optimizes a weighted SSIM loss plus a gradient loss . Experiments on RoadScene and NIRScene show BusReF achieves higher reconstructible-region registration quality (e.g., , ) and competitive fusion quality, yielding robust IVRF under unaligned inputs with practical implications for downstream tasks.

Abstract

In a scenario where multi-modal cameras are operating together, the problem of working with non-aligned images cannot be avoided. Yet, existing image fusion algorithms rely heavily on strictly registered input image pairs to produce more precise fusion results, as a way to improve the performance of downstream high-level vision tasks. In order to relax this assumption, one can attempt to register images first. However, the existing methods for registering multiple modalities have limitations, such as complex structures and reliance on significant semantic information. This paper aims to address the problem of image registration and fusion in a single framework, called BusRef. We focus on Infrared-Visible image registration and fusion task (IVRF). In this framework, the input unaligned image pairs will pass through three stages: Coarse registration, Fine registration and Fusion. It will be shown that the unified approach enables more robust IVRF. We also propose a novel training and evaluation strategy, involving the use of masks to reduce the influence of non-reconstructible regions on the loss functions, which greatly improves the accuracy and robustness of the fusion task. Last but not least, a gradient-aware fusion network is designed to preserve the complementary information. The advanced performance of this algorithm is demonstrated by
Paper Structure (12 sections, 13 equations, 7 figures, 3 tables)

This paper contains 12 sections, 13 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Comparison of frameworks. Existing serial training methods and our proposed Bus like training.
  • Figure 2: (a) Original image. The red mask represents the unreconstructible area, the green represents our reconstructible content mask. (b) Artificially generated unaligned image. (c) Ground truth registration results. (d) Possible artifacts of not applying the reconstructible mask.
  • Figure 3: The Reconstructor is an Auto-Encoder framework that is trained by simultaneously inputting Infrared-Visible images and requiring the reconstruction of the input images to ensure the ability to extract multi-modal features. The registration module is mounted on this pre-trained framework to ensure the acquisition of detailed features. Finally, the affine transformation parameter $\theta$ and the deformation field $\phi$ corresponding to the rigid and elastic transforms are learnt. Finally, the mesh resampling is performed to achieve the image registration.
  • Figure 4: The architecture of GAF. The registered image pairs are inputted to the feature extractor, and gradient sensing is performed on the extracted features. $G$ is the Laplacian operator. After separating the high-frequency part and low-frequency part by Maxpool and Avgpool respectively, two MLPs are used to learn inter-modality weighting. Finally, the fused image is obtained by output convolutional layers and Tanh activation.
  • Figure 5: A qualitative comparison of fusion and registration on the TIR task (RoadScene) dataset. (a), (b) the original visible image and thermal Infrared image, respectively. (c) the directly fused results obtained by GAF. (d) SIFT+GAF. (e) LoFTR+GAF. (f), (g) the results obtained by MURF and SemLA. (h) the results of BusReF.
  • ...and 2 more figures