Table of Contents
Fetching ...

Tell Me Where You Are: Multimodal LLMs Meet Place Recognition

Zonglin Lyu, Juexiao Zhang, Mingxuan Lu, Yiming Li, Chen Feng

TL;DR

This work introduces multimodal LLMs to visual place recognition (VPR), where a robot must localize itself using visual observations and uses vision-based retrieval to propose several candidates and then leverage language-based reasoning to carefully inspect each candidate for a final decision.

Abstract

Large language models (LLMs) exhibit a variety of promising capabilities in robotics, including long-horizon planning and commonsense reasoning. However, their performance in place recognition is still underexplored. In this work, we introduce multimodal LLMs (MLLMs) to visual place recognition (VPR), where a robot must localize itself using visual observations. Our key design is to use vision-based retrieval to propose several candidates and then leverage language-based reasoning to carefully inspect each candidate for a final decision. Specifically, we leverage the robust visual features produced by off-the-shelf vision foundation models (VFMs) to obtain several candidate locations. We then prompt an MLLM to describe the differences between the current observation and each candidate in a pairwise manner, and reason about the best candidate based on these descriptions. Our results on three datasets demonstrate that integrating the general-purpose visual features from VFMs with the reasoning capabilities of MLLMs already provides an effective place recognition solution, without any VPR-specific supervised training. We believe our work can inspire new possibilities for applying and designing foundation models, i.e., VFMs, LLMs, and MLLMs, to enhance the localization and navigation of mobile robots.

Tell Me Where You Are: Multimodal LLMs Meet Place Recognition

TL;DR

This work introduces multimodal LLMs to visual place recognition (VPR), where a robot must localize itself using visual observations and uses vision-based retrieval to propose several candidates and then leverage language-based reasoning to carefully inspect each candidate for a final decision.

Abstract

Large language models (LLMs) exhibit a variety of promising capabilities in robotics, including long-horizon planning and commonsense reasoning. However, their performance in place recognition is still underexplored. In this work, we introduce multimodal LLMs (MLLMs) to visual place recognition (VPR), where a robot must localize itself using visual observations. Our key design is to use vision-based retrieval to propose several candidates and then leverage language-based reasoning to carefully inspect each candidate for a final decision. Specifically, we leverage the robust visual features produced by off-the-shelf vision foundation models (VFMs) to obtain several candidate locations. We then prompt an MLLM to describe the differences between the current observation and each candidate in a pairwise manner, and reason about the best candidate based on these descriptions. Our results on three datasets demonstrate that integrating the general-purpose visual features from VFMs with the reasoning capabilities of MLLMs already provides an effective place recognition solution, without any VPR-specific supervised training. We believe our work can inspire new possibilities for applying and designing foundation models, i.e., VFMs, LLMs, and MLLMs, to enhance the localization and navigation of mobile robots.

Paper Structure

This paper contains 13 sections, 10 figures, 1 table.

Figures (10)

  • Figure 1: Vision and language meet place recognition. Alice gives Bob verbal descriptions of her surroundings (a). Bob compares his visual observations with Alice's descriptions (b)-(d) and reasons about their accuracy, confirming (d) as the correct place.
  • Figure 2: Comparison between traditional VPR methods and our LLM-VPR. We build a VPR solution based on off-the-shelf foundation models and do not need VPR-specific supervised training. Meanwhile, we leverage language-based reasoning to further refine the localization precision.
  • Figure 3: Overview of LLM-VPR. (a) Vision-based Coarse Retriever. (b) Vision-Language Refiner. We first coarsely retrieve top-10 candidates via [CLS] token or GeM aggregated descriptor of DINOv2 oquab2023dinov2 features. Then we construct ten query-candidate pairs and feed them one by one to the Vision-Language Refiner to describe and reason.
  • Figure 4: Example of how our method works. The query is selected from Pittsburgh30K torii2013visual, where candidate 1 is the top-1 retrieval of DINOv2 + GeM, and Candidate 2 is the Top-1 retrieval of our method. Correct retrieval is with green border, while the incorrect one is with red border.
  • Figure 5: Examples of (a) success cases, (b) fail cases, and (c) 'cannot help' cases. Correct retrievals (top 1) are in green border, and incorrect ones are in red border. The top two rows in (a) and the first row in (b) are selected from Tokyo247. The last row in (c) is selected from Pittsburgh30K. The other examples are selected from Baidu Mall. Text boxes are summarized GPT4-V outputs.
  • ...and 5 more figures