Table of Contents
Fetching ...

OSDM-MReg: Multimodal Image Registration based One Step Diffusion Model

Xiaochen Wei, Weiwei Guo, Wenxian Yu, Feiming Wei, Dongying Li

TL;DR

This work proposes OSDM-MReg, a novel multimodal image registration framework that bridges the modality gap through image-to-image translation, and introduces a one-step unaligned target-guided conditional diffusion model (UTGOS-CDM) to translate source and target images into a unified representation domain.

Abstract

Multimodal remote sensing image registration aligns images from different sensors for data fusion and analysis. However, existing methods often struggle to extract modality-invariant features when faced with large nonlinear radiometric differences, such as those between SAR and optical images. To address these challenges, we propose OSDM-MReg, a novel multimodal image registration framework that bridges the modality gap through image-to-image translation. Specifically, we introduce a one-step unaligned target-guided conditional diffusion model (UTGOS-CDM) to translate source and target images into a unified representation domain. Unlike traditional conditional DDPM that require hundreds of iterative steps for inference, our model incorporates a novel inverse translation objective during training to enable direct prediction of the translated image in a single step at test time, significantly accelerating the registration process. After translation, we design a multimodal multiscale registration network (MM-Reg) that extracts and fuses both unimodal and translated multimodal images using the proposed multimodal fusion strategy, enhancing the robustness and precision of alignment across scales and modalities. Extensive experiments on the OSdataset demonstrate that OSDM-MReg achieves superior registration accuracy compared to state-of-the-art methods.

OSDM-MReg: Multimodal Image Registration based One Step Diffusion Model

TL;DR

This work proposes OSDM-MReg, a novel multimodal image registration framework that bridges the modality gap through image-to-image translation, and introduces a one-step unaligned target-guided conditional diffusion model (UTGOS-CDM) to translate source and target images into a unified representation domain.

Abstract

Multimodal remote sensing image registration aligns images from different sensors for data fusion and analysis. However, existing methods often struggle to extract modality-invariant features when faced with large nonlinear radiometric differences, such as those between SAR and optical images. To address these challenges, we propose OSDM-MReg, a novel multimodal image registration framework that bridges the modality gap through image-to-image translation. Specifically, we introduce a one-step unaligned target-guided conditional diffusion model (UTGOS-CDM) to translate source and target images into a unified representation domain. Unlike traditional conditional DDPM that require hundreds of iterative steps for inference, our model incorporates a novel inverse translation objective during training to enable direct prediction of the translated image in a single step at test time, significantly accelerating the registration process. After translation, we design a multimodal multiscale registration network (MM-Reg) that extracts and fuses both unimodal and translated multimodal images using the proposed multimodal fusion strategy, enhancing the robustness and precision of alignment across scales and modalities. Extensive experiments on the OSdataset demonstrate that OSDM-MReg achieves superior registration accuracy compared to state-of-the-art methods.

Paper Structure

This paper contains 17 sections, 8 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Overview of the proposed OSDM-MReg framework. The source image $\bm{\mathrm{I}}^\mathrm{S}$ is first translated into the target domain via UTGOS-CDM, which employs a DDPM-based denoising network $\Psi_{\mathrm{ddpm}}$ and a reconstruction module $\Psi_{\mathrm{recon}}$ to generate $\bm{\mathrm{I}}^{\mathrm{S} \rightarrow \mathrm{T}}$. The unimodal pair $\{\bm{\mathrm{I}}^{\mathrm{S} \rightarrow \mathrm{T}}, \bm{\mathrm{I}}^\mathrm{T}\}$ is then used to estimate the initial corner displacement $\bm{\hat{\mathrm{D}}}_{\mathrm{q^u}}^\mathrm{u}$ via MM-Reg. Subsequently, the original multimodal pair $\{\bm{\mathrm{I}}^\mathrm{S}, \bm{\mathrm{I}}^\mathrm{T}\}$ is utilized to predict the final displacement $\bm{\hat{\mathrm{D}}}_{\mathrm{q^u}+\mathrm{q^m}}^\mathrm{m}$, guided by the initial estimate.
  • Figure 2: Overview of UTGOS-CDM. The model involves two forward and two reverse processes. Two noisy target images $\bm{\mathrm{I}}^\mathrm{T}_{t_1+1}$ and $\bm{\mathrm{I}}^\mathrm{T}_{t_2+1}$ are generated by adding Gaussian noise to $\bm{\mathrm{I}}^\mathrm{T}$. The first reverse process is conditioned on ${\bm{\mathrm{H}}(\bm{\mathrm{I}}^\mathrm{S}), \bm{\mathrm{H}}^{-1}(\bm{\mathrm{I}}^\mathrm{T})}$, and the second predicts the translated source image $\bm{\mathrm{I}}^{\mathrm{S} \rightarrow \mathrm{T}}$ via one-step reconstruction.
  • Figure 3: Training flow of MM-Reg. The framework contains two branches: (1) a unimodal branch with input ${\bm{\mathrm{I}}^{\mathrm{S}\rightarrow \mathrm{T}}, \bm{\mathrm{I}}^\mathrm{T}}$, and (2) a multimodal branch with input $\{\bm{\mathrm{I}}^\mathrm{S}, \bm{\mathrm{I}}^\mathrm{T}\}$. Both adopt multiscale iterative updates with 2 steps per scale, starting from $\bm{\hat{\mathrm{D}}}_0 = \bm{\mathrm{0}}$.
  • Figure 4: Test flowchart of the proposed OSDM-MReg. The unimodal prediction $\bm{\hat{\mathrm{D}}}^\mathrm{u}_{\mathrm{q_1^u+q_2^u+q_4^u+q_8^u}}$ is used as the initial estimation for the multimodal branch to produce the final prediction $\bm{\hat{\mathrm{D}}}^\mathrm{m}_{\mathrm{\sum + q_1^m+q_2^m+q_4^m+q_8^m}}$. Stage weights are set to $(2, 1, 0, 0)$ for the unimodal and $(0, 1, 2, 2)$ for the multimodal branch during testing.
  • Figure 5: When time step $t_\mathrm{t}=200,300,400,500,600,700,800$, the average corner error of our OSDM-MReg on the validation dataset.
  • ...and 1 more figures