Table of Contents
Fetching ...

SignNav: Leveraging Signage for Semantic Visual Navigation in Large-Scale Indoor Environments

Jian Sun, Yuming Huang, He Li, Shuqi Xiao, Shenyan Guo, Maani Ghaffari, Qingbiao Li, Chengzhong Xu, Hui Kong

Abstract

Humans routinely leverage semantic hints provided by signage to navigate to destinations within novel Large-Scale Indoor (LSI) environments, such as hospitals and airport terminals. However, this capability remains underexplored within the field of embodied navigation. This paper introduces a novel embodied navigation task, SignNav, which requires the agent to interpret semantic hint from signage and reason about the subsequent action based on current observation. To facilitate research in this domain, we construct the LSI-Dataset for the training and evaluation of various SignNav agents. Dynamically changing semantic hints and sparse placement of signage in LSI environments present significant challenges to the SignNav task. To address these challenges, we propose the Spatial-Temporal Aware Transformer (START) model for end-to-end decision-making. The spatial-aware module grounds the semantic hint of signage into physical world, while the temporal-aware module captures long-range dependencies between historical states and current observation. Leveraging a two-stage training strategy with Dataset Aggregation (DAgger), our approach achieves state-of-the-art performance, recording an 80% Success Rate (SR) and 0.74 NDTW on val-unseen split. Real-world deployment further demonstrates the practicality of our method in physical environment without pre-built map.

SignNav: Leveraging Signage for Semantic Visual Navigation in Large-Scale Indoor Environments

Abstract

Humans routinely leverage semantic hints provided by signage to navigate to destinations within novel Large-Scale Indoor (LSI) environments, such as hospitals and airport terminals. However, this capability remains underexplored within the field of embodied navigation. This paper introduces a novel embodied navigation task, SignNav, which requires the agent to interpret semantic hint from signage and reason about the subsequent action based on current observation. To facilitate research in this domain, we construct the LSI-Dataset for the training and evaluation of various SignNav agents. Dynamically changing semantic hints and sparse placement of signage in LSI environments present significant challenges to the SignNav task. To address these challenges, we propose the Spatial-Temporal Aware Transformer (START) model for end-to-end decision-making. The spatial-aware module grounds the semantic hint of signage into physical world, while the temporal-aware module captures long-range dependencies between historical states and current observation. Leveraging a two-stage training strategy with Dataset Aggregation (DAgger), our approach achieves state-of-the-art performance, recording an 80% Success Rate (SR) and 0.74 NDTW on val-unseen split. Real-world deployment further demonstrates the practicality of our method in physical environment without pre-built map.
Paper Structure (18 sections, 11 equations, 8 figures, 5 tables)

This paper contains 18 sections, 11 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Capital Region International Airport in Clinton County, Michigan. When humans first arrive at an airport terminal, they routinely leverage the semantic hints of signage to find their boarding gates. We formulate this problem as an embodied navigation task SignNav and build the LSI-Dataset to promote the solution of SignNav.
  • Figure 2: An episode example in a hospital scene. The agent is required to make action decisions according to the semantic hints of signage (detected directional arrows) and finally arrive at the target location.
  • Figure 3: The architecture of our Spatial-Temporal Aware Transformer (START) model for the SignNav task. START uses a spatial-aware module for grounding the semantic hint of sigange into egocentric observation, and a temporal-aware module for capturing long-range dependencies between the historical states and current observation.
  • Figure 4: Different action decisions under different circumstances. (a) different semantic hints at the same location; (b) the same semantic hints at different locations; (c) different semantic hints at different locations.
  • Figure 5: One period in an episode that there is no visible semantic hints of signage. The agent still needs to make correct decisions when there is no visible semantic hint.
  • ...and 3 more figures