Table of Contents
Fetching ...

Leveraging Semantic Cues from Foundation Vision Models for Enhanced Local Feature Correspondence

Felipe Cadar, Guilherme Potje, Renato Martins, Cédric Demonceaux, Erickson R. Nascimento

TL;DR

A new method that uses semantic cues from foundation vision model features (like DINOv2) to enhance local feature matching by incorporating semantic reasoning into existing descriptors, allowing feature caching and fast matching using similarity search, unlike learned matchers.

Abstract

Visual correspondence is a crucial step in key computer vision tasks, including camera localization, image registration, and structure from motion. The most effective techniques for matching keypoints currently involve using learned sparse or dense matchers, which need pairs of images. These neural networks have a good general understanding of features from both images, but they often struggle to match points from different semantic areas. This paper presents a new method that uses semantic cues from foundation vision model features (like DINOv2) to enhance local feature matching by incorporating semantic reasoning into existing descriptors. Therefore, the learned descriptors do not require image pairs at inference time, allowing feature caching and fast matching using similarity search, unlike learned matchers. We present adapted versions of six existing descriptors, with an average increase in performance of 29% in camera localization, with comparable accuracy to existing matchers as LightGlue and LoFTR in two existing benchmarks. Both code and trained models are available at https://www.verlab.dcc.ufmg.br/descriptors/reasoning_accv24

Leveraging Semantic Cues from Foundation Vision Models for Enhanced Local Feature Correspondence

TL;DR

A new method that uses semantic cues from foundation vision model features (like DINOv2) to enhance local feature matching by incorporating semantic reasoning into existing descriptors, allowing feature caching and fast matching using similarity search, unlike learned matchers.

Abstract

Visual correspondence is a crucial step in key computer vision tasks, including camera localization, image registration, and structure from motion. The most effective techniques for matching keypoints currently involve using learned sparse or dense matchers, which need pairs of images. These neural networks have a good general understanding of features from both images, but they often struggle to match points from different semantic areas. This paper presents a new method that uses semantic cues from foundation vision model features (like DINOv2) to enhance local feature matching by incorporating semantic reasoning into existing descriptors. Therefore, the learned descriptors do not require image pairs at inference time, allowing feature caching and fast matching using similarity search, unlike learned matchers. We present adapted versions of six existing descriptors, with an average increase in performance of 29% in camera localization, with comparable accuracy to existing matchers as LightGlue and LoFTR in two existing benchmarks. Both code and trained models are available at https://www.verlab.dcc.ufmg.br/descriptors/reasoning_accv24

Paper Structure

This paper contains 30 sections, 6 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Leveraging semantic information for improving visual correspondence. The figure illustrates the matching process using Mutual Nearest Neighbor (MNN) for the base descriptor XFeat cit:xfeat and for our approach, which employs semantic conditioning (shown in the top right). Correct matches are shown in green and wrong matches in red. We can also assess the interpretability and consistency of the descriptors by finding the closest $128$ matches to a given query point in the image (red point in the bottom left) using either semantic or texture features. Hotter colors indicate higher similarities. Please notice the similarity ranking improvement with the conditioned features around the sink region.
  • Figure 2: Semantic Conditioning Pipeline. Our method first extracts both low-level scene texture via a base local feature descriptor (XFeat) and semantically meaningful, high-level features via a foundation vision model (DINOv2) with the associated salient texture keypoints. Then, the Reasoning module is applied in both representations, where cross attention layers are used iteratively to enhance the representations of both texture and semantic features. Finally, the descriptor similarity is computed by combining both the texture and semantic similarity using element-wise product ($\odot$).
  • Figure 3: Visual matching results for three refined base descriptors in the two benchmarks. Green matches are inliers and red outlier matches. The left side of the figure shows the results for the original descriptors SuperPoint cit:superpoint, XFeat cit:xfeat and ALIKED cit:aliked when matching two different image pairs. The right side of the figure shows the matching using their semantically conditioned versions using our proposed methodology. We can observe from visual inspection that the matches are more consistent between the views. Please also notice the increased inlier ratio when considering the semantic conditioned versions.
  • Figure 4: Interpretability and consistency of the conditioned features. We show the closest $128$ matches to a given query keypoint (red point in the first column) for the different descriptors with either solely semantics, the refined texture descriptor or with the proposed semantic conditioned features (fourth column). Hotter colors indicate higher similarity. Please notice the similarity ranking improvement with the conditioned features for finding matches such as in the mouse cable (first row). Our approach consistency is highlighted in the estimated closest keypoints when selecting the drawer handle of the kitchen (fourth row) which is occluded in the paired view.