Table of Contents
Fetching ...

Semantic-aware Representation Learning for Homography Estimation

Yuhan Liu, Qianxin Huang, Siqi Hui, Jingwen Fu, Sanping Zhou, Kangyi Wu, Pengna Li, Jinjun Wang

TL;DR

This work tackles homography estimation with detector-free dense matching by integrating fine-grained semantics from vision foundation models. SRMatcher combines a frozen semantic extractor (e.g., DINOv2) with a Semantic-aware Fusion Block that enables cross-image semantic feature fusion via a Semantic-guide Interactions Block, producing improved coarse and fine correspondences. Self-supervised training via geometric transformations circumvents manual labeling, and a novel overlap-based fine matching refines sub-pixel accuracy. Empirical results on HPatches, ISC-HE, and MegaDepth demonstrate state-of-the-art performance and robust cross-task benefits, with SRMatcher readily plug-and-play to other matching architectures, yielding substantial precision gains in homography estimation and related tasks.

Abstract

Homography estimation is the task of determining the transformation from an image pair. Our approach focuses on employing detector-free feature matching methods to address this issue. Previous work has underscored the importance of incorporating semantic information, however there still lacks an efficient way to utilize semantic information. Previous methods suffer from treating the semantics as a pre-processing, causing the utilization of semantics overly coarse-grained and lack adaptability when dealing with different tasks. In our work, we seek another way to use the semantic information, that is semantic-aware feature representation learning framework.Based on this, we propose SRMatcher, a new detector-free feature matching method, which encourages the network to learn integrated semantic feature representation.Specifically, to capture precise and rich semantics, we leverage the capabilities of recently popularized vision foundation models (VFMs) trained on extensive datasets. Then, a cross-images Semantic-aware Fusion Block (SFB) is proposed to integrate its fine-grained semantic features into the feature representation space. In this way, by reducing errors stemming from semantic inconsistencies in matching pairs, our proposed SRMatcher is able to deliver more accurate and realistic outcomes. Extensive experiments show that SRMatcher surpasses solid baselines and attains SOTA results on multiple real-world datasets. Compared to the previous SOTA approach GeoFormer, SRMatcher increases the area under the cumulative curve (AUC) by about 11% on HPatches. Additionally, the SRMatcher could serve as a plug-and-play framework for other matching methods like LoFTR, yielding substantial precision improvement.

Semantic-aware Representation Learning for Homography Estimation

TL;DR

This work tackles homography estimation with detector-free dense matching by integrating fine-grained semantics from vision foundation models. SRMatcher combines a frozen semantic extractor (e.g., DINOv2) with a Semantic-aware Fusion Block that enables cross-image semantic feature fusion via a Semantic-guide Interactions Block, producing improved coarse and fine correspondences. Self-supervised training via geometric transformations circumvents manual labeling, and a novel overlap-based fine matching refines sub-pixel accuracy. Empirical results on HPatches, ISC-HE, and MegaDepth demonstrate state-of-the-art performance and robust cross-task benefits, with SRMatcher readily plug-and-play to other matching architectures, yielding substantial precision gains in homography estimation and related tasks.

Abstract

Homography estimation is the task of determining the transformation from an image pair. Our approach focuses on employing detector-free feature matching methods to address this issue. Previous work has underscored the importance of incorporating semantic information, however there still lacks an efficient way to utilize semantic information. Previous methods suffer from treating the semantics as a pre-processing, causing the utilization of semantics overly coarse-grained and lack adaptability when dealing with different tasks. In our work, we seek another way to use the semantic information, that is semantic-aware feature representation learning framework.Based on this, we propose SRMatcher, a new detector-free feature matching method, which encourages the network to learn integrated semantic feature representation.Specifically, to capture precise and rich semantics, we leverage the capabilities of recently popularized vision foundation models (VFMs) trained on extensive datasets. Then, a cross-images Semantic-aware Fusion Block (SFB) is proposed to integrate its fine-grained semantic features into the feature representation space. In this way, by reducing errors stemming from semantic inconsistencies in matching pairs, our proposed SRMatcher is able to deliver more accurate and realistic outcomes. Extensive experiments show that SRMatcher surpasses solid baselines and attains SOTA results on multiple real-world datasets. Compared to the previous SOTA approach GeoFormer, SRMatcher increases the area under the cumulative curve (AUC) by about 11% on HPatches. Additionally, the SRMatcher could serve as a plug-and-play framework for other matching methods like LoFTR, yielding substantial precision improvement.
Paper Structure (34 sections, 7 equations, 12 figures, 14 tables)

This paper contains 34 sections, 7 equations, 12 figures, 14 tables.

Figures (12)

  • Figure 1: Homography transformation results by our proposed SRMatcher and MESA zhang2024mesa. The blue box line was added artificially to highlight the range of homologous transformations. The yellow boxes show the real scenes cropped in the target image. (c) and (d) are generated by superimposing the warped source image on the target image, showing our SRMatcher acquiring more accurate and realistic outcomes.
  • Figure 2: Overview of our SRMatcher for detector-free local feature matching and followed by homography estimation. With a pretrained Semantic extract network, our SRMatcher utilizes fine-grained features to improve the matching results. The SFB enable interactions between image features $\hat{C_{0}}$ and $\hat{C_{1}}$ and semantic features $S_{0}$, $S_{1}$, produce the fusion features. The coarse matching block generates pixel-to-pixel matches $M_{c}$ at 1/8 scale. Subsequently, the $M_{c}$ input into the overlap-based fine matching to yield fine matches $M_{f}$ at 1/2 scale.
  • Figure 3: Architecture of the (a)semantic-aware fusion block(SFB) and (b)semantic-guide interactions block(SGIB). The SFB fuses image features and semantic features across images. Inside SFB, SGIB computes cross-attention that the image features as key K / value V and the semantic feature as query Q.
  • Figure 4: Performance comparison. "-S" means use DINOv2 as the backbone to scale the parameters. "-Backbone" means use DINOv2 as the backbone to get coarse features. "-Concat" means concatenate the DINOv2 and CNN features. "-ViT" and "-VGG16" mean use different semantic extractors.
  • Figure 5: Qualitative of matching results with LoFTR sun2021loftr, GeoFormer liu2023geometrized, MESA zhang2024mesa, and our SRMatcher. Points classified as inliers by RANSAC are displayed in green, while outliers are shown in red.
  • ...and 7 more figures