Towards Generalized Multimodal Homography Estimation

Jinkun You; Jiaxin Cheng; Jie Zhang; Yicong Zhou

Towards Generalized Multimodal Homography Estimation

Jinkun You, Jiaxin Cheng, Jie Zhang, Yicong Zhou

TL;DR

A training data synthesis method that generates unaligned image pairs with ground-truth offsets from a single input image that empowering the trained model to achieve greater robustness and improved generalization across various domains is proposed.

Abstract

Supervised and unsupervised homography estimation methods depend on image pairs tailored to specific modalities to achieve high accuracy. However, their performance deteriorates substantially when applied to unseen modalities. To address this issue, we propose a training data synthesis method that generates unaligned image pairs with ground-truth offsets from a single input image. Our approach renders the image pairs with diverse textures and colors while preserving their structural information. These synthetic data empower the trained model to achieve greater robustness and improved generalization across various domains. Additionally, we design a network to fully leverage cross-scale information and decouple color information from feature representations, thus improving estimation accuracy. Extensive experiments show that our training data synthesis method improves generalization performance. The results also confirm the effectiveness of the proposed network.

Towards Generalized Multimodal Homography Estimation

TL;DR

Abstract

Paper Structure (12 sections, 18 equations, 5 figures, 6 tables)

This paper contains 12 sections, 18 equations, 5 figures, 6 tables.

Introduction
Related Work
Proposed Method
Overall Framework
Training Data Synthesis
Cross-Scale and Color-Invariant Network
Loss Function
Experiments
Experimental Settings
Training Data Synthesis
Cross-Scale and Color-Invariant Network
Conclusion

Figures (5)

Figure 1: Estimation results. The rows present the results of the supervised MCNet zhu2024mcnet and the unsupervised SSHNet yu2025sshnet. The first two columns represent within-dataset results, while the remaining columns depict cross-dataset performance. Greater similarity between the green and red quadrilaterals indicates higher accuracy.
Figure 2: Illustration of the training data synthesis and the homography estimation network. (a) Training Data Synthesis can be applied to a public RGB dataset for zero-shot homography estimation or integrated with existing datasets to enhance generalization. (b) Cross-Scale and Color-Invariant Network (CCNet) integrates cross-scale information into the extracted features while decoupling color information from the feature representations. (c) Color Decoupling. (d) Iterative Homography Estimation. (e) Multiscale Feature Extractor.
Figure 3: Examples of synthetic data. The first row presents the synthesis results with various content weights. The second row shows the results with different template images. The third row displays the examples with different smoothing weights. The weights become smaller and greater from left to right for the first and third rows, respectively.
Figure 4: Visualization results of within-dataset evaluation. The first row presents the estimation results from GoogleEarth, while the second row features results from RGB-NIR. The first column displays the source image to be warped, while the other columns demonstrate the target images. Greater similarity between the red and green quadrilaterals indicates higher accuracy in the estimation.
Figure 5: Visualization results of zero-shot evaluation using testing images from GoogleMap. The first column displays the source image to be warped. A greater similarity between the red and green quadrilaterals indicates higher accuracy in estimation.

Towards Generalized Multimodal Homography Estimation

TL;DR

Abstract

Towards Generalized Multimodal Homography Estimation

Authors

TL;DR

Abstract

Table of Contents

Figures (5)