SCPNet: Unsupervised Cross-modal Homography Estimation via Intra-modal Self-supervised Learning

Runmin Zhang; Jun Ma; Si-Yuan Cao; Lun Luo; Beinan Yu; Shu-Jie Chen; Junwei Li; Hui-Liang Shen

SCPNet: Unsupervised Cross-modal Homography Estimation via Intra-modal Self-supervised Learning

Runmin Zhang, Jun Ma, Si-Yuan Cao, Lun Luo, Beinan Yu, Shu-Jie Chen, Junwei Li, Hui-Liang Shen

TL;DR

SCPNet tackles unsupervised cross-modal homography estimation between satellite and map modalities under large offsets and modality gaps by introducing intra-modal self-supervised learning, a correlation-based homography estimator, and a consistent feature map projector. The framework comprises two self-supervised branches and a cross-modal supervision branch, enabling effective learning without ground-truth homographies and achieving state-of-the-art unsupervised performance on challenging datasets such as GoogleMap, Flash/no-flash, Harvard, RGB/NIR, and PDS-COCO. Key contributions include the concept of intra-modal self-supervision, the correlation-based estimation network, and the consistent feature map projection, which together significantly improve cross-modal alignment and reduce MACE compared to supervised baselines. Practically, SCPNet reduces reliance on ground-truth data and provides a scalable approach to robust cross-modal image registration across diverse modalities and spectral bands.

Abstract

We propose a novel unsupervised cross-modal homography estimation framework based on intra-modal Self-supervised learning, Correlation, and consistent feature map Projection, namely SCPNet. The concept of intra-modal self-supervised learning is first presented to facilitate the unsupervised cross-modal homography estimation. The correlation-based homography estimation network and the consistent feature map projection are combined to form the learnable architecture of SCPNet, boosting the unsupervised learning framework. SCPNet is the first to achieve effective unsupervised homography estimation on the satellite-map image pair cross-modal dataset, GoogleMap, under [-32,+32] offset on a 128x128 image, leading the supervised approach MHN by 14.0% of mean average corner error (MACE). We further conduct extensive experiments on several cross-modal/spectral and manually-made inconsistent datasets, on which SCPNet achieves the state-of-the-art (SOTA) performance among unsupervised approaches, and owns 49.0%, 25.2%, 36.4%, and 10.7% lower MACEs than the supervised approach MHN. Source code is available at https://github.com/RM-Zhang/SCPNet.

SCPNet: Unsupervised Cross-modal Homography Estimation via Intra-modal Self-supervised Learning

TL;DR

Abstract

Paper Structure (15 sections, 9 equations, 5 figures, 5 tables)

This paper contains 15 sections, 9 equations, 5 figures, 5 tables.

Introduction
Related Work
Pilot Experiments and Finding
SCPNet
Correlation-based Homography Estimation Network
Consistent Feature Map Projector
Training/Inference Framework
Loss Function and Implementation Details
Experiments
Datasets and Experimental Settings
Ablation
Evaluation on Cross-modal/spectral Datasets
Evaluation on PDS-COCO
Computational Burden
Conclusions

Figures (5)

Figure 1: Unsupervised homography estimation results of UDHN nguyen2018unsupervised, CA-UDHN zhang2020content, biHomE koguciuk2021perceptual, and our SCPNet on GoogleMap dataset under [-32,+32] offset. CL denotes the common cross-modal intensity-based learning, SL denotes the intra-modal self-supervised learning, C denotes correlation, and P denotes consistent feature map projection. CL (perceptual) means the cross-modal intensity-based learning is conducted by the perceptual loss. Green polygons denote the ground-truth homography deformation from $\mathbf{I}_\mathrm{B}$ (map) to $\mathbf{I}_\mathrm{A}$ (satellite). Red polygons denote the estimated homography deformation using different algorithms on $\mathbf{I}_\mathrm{A}$ (satellite). Different from the previous works that only adopt cross-modal intensity-based learning, SCPNet introduces intra-modal self-supervised learning as extra supervision and has a special architecture based on correlation and consistent feature map projection, leading to successful unsupervised cross-modal homography learning under large offsets and modality gaps.
Figure 2: The cross-modal test MACEs of the network trained using intra-modal self-supervised learning and cross-modal intensity-based learning during the training iterations, respectively.
Figure 3: Schematic diagram of unsupervised cross-modal homography estimation framework using intra-modal Self-supervised learning, Correlation, and consistent feature map Projection, namely SCPNet. (a) Overall structure and training/inference strategy of SCPNet. (b) Detailed illustration of the correlation-based homography estimation network. (c) Detailed structure of the consistent feature map projector.
Figure 4: Comparison of the consistent feature maps produced by concatenation and correlation.
Figure 5: Qualitative homography estimation results on GoogleMap, Flash/no-flash, Harvard, and RGB/NIR datasets respectively. Green polygons denote the ground-truth homography deformation from $\mathbf{I}_\mathrm{B}$ (source, the deformed image) to $\mathbf{I}_\mathrm{A}$ (target). Red polygons denote the estimated homography deformation using different algorithms on the target images.

SCPNet: Unsupervised Cross-modal Homography Estimation via Intra-modal Self-supervised Learning

TL;DR

Abstract

SCPNet: Unsupervised Cross-modal Homography Estimation via Intra-modal Self-supervised Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (5)