Table of Contents
Fetching ...

Context-Based Visual-Language Place Recognition

Soojin Woo, Seong-Woo Kim

TL;DR

A novel VPR approach that remains robust to scene changes and does not require additional training is introduced that outperforms non-learned image representation techniques and off-the-shelf convolutional neural network (CNN) descriptors.

Abstract

In vision-based robot localization and SLAM, Visual Place Recognition (VPR) is essential. This paper addresses the problem of VPR, which involves accurately recognizing the location corresponding to a given query image. A popular approach to vision-based place recognition relies on low-level visual features. Despite significant progress in recent years, place recognition based on low-level visual features is challenging when there are changes in scene appearance. To address this, end-to-end training approaches have been proposed to overcome the limitations of hand-crafted features. However, these approaches still fail under drastic changes and require large amounts of labeled data to train models, presenting a significant limitation. Methods that leverage high-level semantic information, such as objects or categories, have been proposed to handle variations in appearance. In this paper, we introduce a novel VPR approach that remains robust to scene changes and does not require additional training. Our method constructs semantic image descriptors by extracting pixel-level embeddings using a zero-shot, language-driven semantic segmentation model. We validate our approach in challenging place recognition scenarios using real-world public dataset. The experiments demonstrate that our method outperforms non-learned image representation techniques and off-the-shelf convolutional neural network (CNN) descriptors. Our code is available at https: //github.com/woo-soojin/context-based-vlpr.

Context-Based Visual-Language Place Recognition

TL;DR

A novel VPR approach that remains robust to scene changes and does not require additional training is introduced that outperforms non-learned image representation techniques and off-the-shelf convolutional neural network (CNN) descriptors.

Abstract

In vision-based robot localization and SLAM, Visual Place Recognition (VPR) is essential. This paper addresses the problem of VPR, which involves accurately recognizing the location corresponding to a given query image. A popular approach to vision-based place recognition relies on low-level visual features. Despite significant progress in recent years, place recognition based on low-level visual features is challenging when there are changes in scene appearance. To address this, end-to-end training approaches have been proposed to overcome the limitations of hand-crafted features. However, these approaches still fail under drastic changes and require large amounts of labeled data to train models, presenting a significant limitation. Methods that leverage high-level semantic information, such as objects or categories, have been proposed to handle variations in appearance. In this paper, we introduce a novel VPR approach that remains robust to scene changes and does not require additional training. Our method constructs semantic image descriptors by extracting pixel-level embeddings using a zero-shot, language-driven semantic segmentation model. We validate our approach in challenging place recognition scenarios using real-world public dataset. The experiments demonstrate that our method outperforms non-learned image representation techniques and off-the-shelf convolutional neural network (CNN) descriptors. Our code is available at https: //github.com/woo-soojin/context-based-vlpr.

Paper Structure

This paper contains 17 sections, 5 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: The language-driven semantic segmentation is based on a pre-defined label set. The segmentation results on the KITTI dataset were obtained by correlating per-pixel embeddings and text embeddings. The pre-defined labels include road, sidewalk, building, vehicle, car, bicycle, motorcycle, vegetation, trunk, terrain, cyclist, pole, sky, and other.
  • Figure 2: System Overview. The pre-trained text encoder of LSeg is used to generate text embeddings from a pre-defined label set. The visual encoder of LSeg extracts per-pixel embeddings from the image, which are then correlated with text embeddings to predict the class label for each pixel. The predicted labels are used to filter out pixel coordinates corresponding to dynamic objects, such as cars. After filtering, $K$ keypoints are randomly selected from the remaining pixel coordinates to create descriptors.
  • Figure 3: Context Graph. Each circle represents the centroid of a cluster and serves as a node in the context graph. The graph visualizes the context graph of each image. These are qualitative results showing that the context graph demonstrates greater similarity among visually similar images.
  • Figure 4: Feature Extraction. (a) In the case of using ORB features, keypoints are extracted even from pixels corresponding to dynamic objects such as cars. In addition, very few features are extracted from the right side of the image, leading to uneven feature extraction across the entire image. (b) In our approach, filtering is applied so that features are not extracted from cars, and features are evenly extracted across the entire image. Additionally, our method uses fewer features compared to ORB, demonstrating an advantage in terms of computing efficiency.
  • Figure 5: Correspondence Matching. The results of correspondence matching are visualized as follows: (a) matching results based on ORB features and (b) matching results based on our method.