Table of Contents
Fetching ...

SignScene: Visual Sign Grounding for Mapless Navigation

Nicky Zimmerman, Joel Loo, Benjamin Koh, Zishuo Wang, David Hsu

TL;DR

SignScene is proposed, a sign-centric spatial-semantic representation that captures navigation-relevant scene elements and sign information, and presents them to VLMs in a form conducive to effective reasoning, and it is demonstrated that it enables real-world mapless navigation on a Spot robot using only signs.

Abstract

Navigational signs enable humans to navigate unfamiliar environments without maps. This work studies how robots can similarly exploit signs for mapless navigation in the open world. A central challenge lies in interpreting signs: real-world signs are diverse and complex, and their abstract semantic contents need to be grounded in the local 3D scene. We formalize this as sign grounding, the problem of mapping semantic instructions on signs to corresponding scene elements and navigational actions. Recent Vision-Language Models (VLMs) offer the semantic common-sense and reasoning capabilities required for this task, but are sensitive to how spatial information is represented. We propose SignScene, a sign-centric spatial-semantic representation that captures navigation-relevant scene elements and sign information, and presents them to VLMs in a form conducive to effective reasoning. We evaluate our grounding approach on a dataset of 114 queries collected across nine diverse environment types, achieving 88% grounding accuracy and significantly outperforming baselines. Finally, we demonstrate that it enables real-world mapless navigation on a Spot robot using only signs.

SignScene: Visual Sign Grounding for Mapless Navigation

TL;DR

SignScene is proposed, a sign-centric spatial-semantic representation that captures navigation-relevant scene elements and sign information, and presents them to VLMs in a form conducive to effective reasoning, and it is demonstrated that it enables real-world mapless navigation on a Spot robot using only signs.

Abstract

Navigational signs enable humans to navigate unfamiliar environments without maps. This work studies how robots can similarly exploit signs for mapless navigation in the open world. A central challenge lies in interpreting signs: real-world signs are diverse and complex, and their abstract semantic contents need to be grounded in the local 3D scene. We formalize this as sign grounding, the problem of mapping semantic instructions on signs to corresponding scene elements and navigational actions. Recent Vision-Language Models (VLMs) offer the semantic common-sense and reasoning capabilities required for this task, but are sensitive to how spatial information is represented. We propose SignScene, a sign-centric spatial-semantic representation that captures navigation-relevant scene elements and sign information, and presents them to VLMs in a form conducive to effective reasoning. We evaluate our grounding approach on a dataset of 114 queries collected across nine diverse environment types, achieving 88% grounding accuracy and significantly outperforming baselines. Finally, we demonstrate that it enables real-world mapless navigation on a Spot robot using only signs.
Paper Structure (19 sections, 14 figures, 2 tables)

This paper contains 19 sections, 14 figures, 2 tables.

Figures (14)

  • Figure 0: Parsing. We use in-context learning to improve the sign understanding performance. We include in the prompt a sign dictionary, with examples of commonly encountered signs and their labels. The output is a list of pairs, where each pair includes a location and its associated navigational instruction.
  • Figure 1: AToM. The 3D map represents signs and explicit navigational structures as abstract objects, and implicit paths as dense point clouds, and can be rendered into top-down, sign-centric views for VLM reasoning.
  • Figure 2: SignScene. In a given a scene, SignScene constructs AToM from RGB observations and robot poses. When deployed on a real robot, it offers modules to align the robot head-on to signs, parse their content, and select signs containing goal-relevant information. The resulting AToM supports queries about directions and actions to take in the local scene.
  • Figure 3: Failure Analysis. Figures (a) and (b) address reasoning failure. Despite accurate map construction and correct parsing, the VLM reasoning step can still fail. The red dot indicates the location of the sign. In (a) the VLM fails to ground correctly the compound instruction "left-then-forward". In (b) the VLM's choice of boundary point C is influenced by "exit c" in the parsed location. Figures (c) addresses parsing in context. All arrows are diagonal but the grounded navigational instructions is different. The green-annotated arrows refer to going forward-left and forward-right towards frontiers A and B, respectively. However, the red-annotated arrow refers to the down-moving escalator to the right, rather than walking backwards-right.
  • Figure 4: Real-world mapless navigation with signs. The robot is given "TERRACE" as a goal: (a) it explores multiple signs until it finds one with information relevant to the goal, then (b) explores the local environment to build AToM, enabling successful grounding and guiding the robot to take the stairs to reach "TERRACE" in (c).
  • ...and 9 more figures