Table of Contents
Fetching ...

Signage-Aware Exploration in Open World using Venue Maps

Chang Chen, Liang Lu, Lei Yang, Yinqiang Zhang, Yizhou Chen, Ruixing Jia, Jia Pan

TL;DR

The paper tackles locating landmarks in unknown open-world environments by leveraging 2D venue maps and scene signage. It introduces a signage understanding pipeline based on diffusion-driven text instance retrieval and 2D-to-3D fusion to robustly recognize signage with arbitrary shapes, coupled with a venue-map-guided exploration-exploitation planner that balances exploration of unknown areas with exploitation to approach and orient toward signs. The key contributions are the topological planning on venue maps, the diffusion-based signage retrieval, and the integrated planning framework that yields higher signage coverage and faster search in large-scale malls, outperforming state-of-the-art text spotting and traditional exploration baselines. The approach demonstrates practical improvements in navigation efficiency and landmark localization, highlighting the value of grounding text-level cues in non-metric venue maps for robust real-world exploration.

Abstract

Current exploration methods struggle to search for shops or restaurants in unknown open-world environments due to the lack of prior knowledge. Humans can leverage venue maps that offer valuable scene priors to aid exploration planning by correlating the signage in the scene with landmark names on the map. However, arbitrary shapes and styles of the texts on signage, along with multi-view inconsistencies, pose significant challenges for robots to recognize them accurately. Additionally, discrepancies between real-world environments and venue maps hinder the integration of text-level information into the planners. This paper introduces a novel signage-aware exploration system to address these challenges, enabling the robots to utilize venue maps effectively. We propose a signage understanding method that accurately detects and recognizes the texts on signage using a diffusion-based text instance retrieval method combined with a 2D-to-3D semantic fusion strategy. Furthermore, we design a venue map-guided exploration-exploitation planner that balances exploration in unknown regions using directional heuristics derived from venue maps and exploitation to get close and adjust orientation for better recognition. Experiments in large-scale shopping malls demonstrate our method's superior signage recognition performance and search efficiency, surpassing state-of-the-art text spotting methods and traditional exploration approaches. Project website: https://sites.google.com/view/signage-aware-exploration.

Signage-Aware Exploration in Open World using Venue Maps

TL;DR

The paper tackles locating landmarks in unknown open-world environments by leveraging 2D venue maps and scene signage. It introduces a signage understanding pipeline based on diffusion-driven text instance retrieval and 2D-to-3D fusion to robustly recognize signage with arbitrary shapes, coupled with a venue-map-guided exploration-exploitation planner that balances exploration of unknown areas with exploitation to approach and orient toward signs. The key contributions are the topological planning on venue maps, the diffusion-based signage retrieval, and the integrated planning framework that yields higher signage coverage and faster search in large-scale malls, outperforming state-of-the-art text spotting and traditional exploration baselines. The approach demonstrates practical improvements in navigation efficiency and landmark localization, highlighting the value of grounding text-level cues in non-metric venue maps for robust real-world exploration.

Abstract

Current exploration methods struggle to search for shops or restaurants in unknown open-world environments due to the lack of prior knowledge. Humans can leverage venue maps that offer valuable scene priors to aid exploration planning by correlating the signage in the scene with landmark names on the map. However, arbitrary shapes and styles of the texts on signage, along with multi-view inconsistencies, pose significant challenges for robots to recognize them accurately. Additionally, discrepancies between real-world environments and venue maps hinder the integration of text-level information into the planners. This paper introduces a novel signage-aware exploration system to address these challenges, enabling the robots to utilize venue maps effectively. We propose a signage understanding method that accurately detects and recognizes the texts on signage using a diffusion-based text instance retrieval method combined with a 2D-to-3D semantic fusion strategy. Furthermore, we design a venue map-guided exploration-exploitation planner that balances exploration in unknown regions using directional heuristics derived from venue maps and exploitation to get close and adjust orientation for better recognition. Experiments in large-scale shopping malls demonstrate our method's superior signage recognition performance and search efficiency, surpassing state-of-the-art text spotting methods and traditional exploration approaches. Project website: https://sites.google.com/view/signage-aware-exploration.

Paper Structure

This paper contains 17 sections, 5 equations, 8 figures, 4 tables, 1 algorithm.

Figures (8)

  • Figure 1: We propose to leverage the textual information in a venue map to facilitate shop searching in unknown open-world environments. The robot localizes itself in the environment by recognizing and matching the texts on a sign to the venue map. Then the robot plans a direction to the next landmark 'Briketenia'.
  • Figure 2: Overall framework. Our method first constructs a topological graph on a given venue map (Sec. \ref{['planning']}). Then, given the RGB-D image, the proposed signage understanding method recognizes the texts on the signage and correlates them with the text set of the venue map (Sec. \ref{['sec:mapping']}). Once localized on the venue map, the next landmark goal is inferred to guide the selection of frontiers. Our system balances exploration and exploitation to improve both signage coverage rates and search efficiency during the process (Sec. \ref{['sec:exploration']}).
  • Figure 3: The pipeline of signage understanding. (a) In the offline stage, we use a text-diffusion model to render the landmark names on the generative images. (b) In the online stage, we project the detected text images to 3D space and fuse the features with those of multi-view images. Then we retrieve the most similar offline images compared to the detected images as the results, which are projected onto a signage map. All text features are extracted by a scene text spotter $\phi$.
  • Figure 4: Position estimation of the next landmark by calculating the coordinate transformation between the online map and the venue map for guiding the frontier selection.
  • Figure 5: Examples of signage recognition. The first column shows the ground truths of the texts. In the remaining columns, red boxes highlight the texts of interest, and the words in white boxes are the corresponding recognition results from ESTextSpotter huang2023estextspotter. The noisy recognition results and the interference from text instances that are not presented on the signage make it difficult for accurate signage recognition. Our method can robustly match the real images with generative images even though all their recognized texts are not the same as the ground truths.
  • ...and 3 more figures