Table of Contents
Fetching ...

Revisit Anything: Visual Place Recognition via Image Segment Retrieval

Kartik Garg, Sai Shubodh Puligilla, Shishir Kolathaya, Madhava Krishna, Sourav Garg

TL;DR

The segments-based approach, dubbed SegVLAD, sets a new state-of-the-art in place recognition on a diverse selection of benchmark datasets, while being applicable to both generic and task-specialized image encoders.

Abstract

Accurately recognizing a revisited place is crucial for embodied agents to localize and navigate. This requires visual representations to be distinct, despite strong variations in camera viewpoint and scene appearance. Existing visual place recognition pipelines encode the "whole" image and search for matches. This poses a fundamental challenge in matching two images of the same place captured from different camera viewpoints: "the similarity of what overlaps can be dominated by the dissimilarity of what does not overlap". We address this by encoding and searching for "image segments" instead of the whole images. We propose to use open-set image segmentation to decompose an image into `meaningful' entities (i.e., things and stuff). This enables us to create a novel image representation as a collection of multiple overlapping subgraphs connecting a segment with its neighboring segments, dubbed SuperSegment. Furthermore, to efficiently encode these SuperSegments into compact vector representations, we propose a novel factorized representation of feature aggregation. We show that retrieving these partial representations leads to significantly higher recognition recall than the typical whole image based retrieval. Our segments-based approach, dubbed SegVLAD, sets a new state-of-the-art in place recognition on a diverse selection of benchmark datasets, while being applicable to both generic and task-specialized image encoders. Finally, we demonstrate the potential of our method to ``revisit anything'' by evaluating our method on an object instance retrieval task, which bridges the two disparate areas of research: visual place recognition and object-goal navigation, through their common aim of recognizing goal objects specific to a place. Source code: https://github.com/AnyLoc/Revisit-Anything.

Revisit Anything: Visual Place Recognition via Image Segment Retrieval

TL;DR

The segments-based approach, dubbed SegVLAD, sets a new state-of-the-art in place recognition on a diverse selection of benchmark datasets, while being applicable to both generic and task-specialized image encoders.

Abstract

Accurately recognizing a revisited place is crucial for embodied agents to localize and navigate. This requires visual representations to be distinct, despite strong variations in camera viewpoint and scene appearance. Existing visual place recognition pipelines encode the "whole" image and search for matches. This poses a fundamental challenge in matching two images of the same place captured from different camera viewpoints: "the similarity of what overlaps can be dominated by the dissimilarity of what does not overlap". We address this by encoding and searching for "image segments" instead of the whole images. We propose to use open-set image segmentation to decompose an image into `meaningful' entities (i.e., things and stuff). This enables us to create a novel image representation as a collection of multiple overlapping subgraphs connecting a segment with its neighboring segments, dubbed SuperSegment. Furthermore, to efficiently encode these SuperSegments into compact vector representations, we propose a novel factorized representation of feature aggregation. We show that retrieving these partial representations leads to significantly higher recognition recall than the typical whole image based retrieval. Our segments-based approach, dubbed SegVLAD, sets a new state-of-the-art in place recognition on a diverse selection of benchmark datasets, while being applicable to both generic and task-specialized image encoders. Finally, we demonstrate the potential of our method to ``revisit anything'' by evaluating our method on an object instance retrieval task, which bridges the two disparate areas of research: visual place recognition and object-goal navigation, through their common aim of recognizing goal objects specific to a place. Source code: https://github.com/AnyLoc/Revisit-Anything.
Paper Structure (47 sections, 6 equations, 8 figures, 12 tables)

This paper contains 47 sections, 6 equations, 8 figures, 12 tables.

Figures (8)

  • Figure 1: Overview of our segment-retrieval based VPR pipeline, dubbed SegVLAD: We use open-set segmentor (SAM) to extract segment masks, which are converted into SuperSegments using the neighbouring image segments. Using pixel-level DINOv2 descriptors with VLAD-based aggregation, we obtain SuperSegment descriptors, which are matched against a flat index of SuperSegment descriptors obtained from all the images of the entire reference database.
  • Figure 2: Neighborhood expansion (Eq. \ref{['eq:ss']}) of a window in the leftmost image to the whole building in the rightmost image, progressing from no neighborhood aggregation to a third-order aggregation. This neighborhood expansion is in stark contrast with a typical regular grid- or patch-based approach which may not capture semantically-meaningful SuperSegments.
  • Figure 3: Illustration of four SuperSegments obtained from the same image. All four of these spatially overlap with each other, which is different from coarse segmentation methods that do not typically allow overlap across segments.
  • Figure 4: Qualitative results: Columns respectively represent the query image, correct match of SegVLAD and incorrect match of AnyLoc. Examples from different datasets: AmsterTime, Baidu Mall, Pitts-30K are presented across the rows.
  • Figure 5: Recall vs storage/retrieval time on AmsterTime.
  • ...and 3 more figures