Can LLMs Generate Human-Like Wayfinding Instructions? Towards Platform-Agnostic Embodied Instruction Synthesis

Vishnu Sashank Dorbala; Sanjoy Chowdhury; Dinesh Manocha

Can LLMs Generate Human-Like Wayfinding Instructions? Towards Platform-Agnostic Embodied Instruction Synthesis

Vishnu Sashank Dorbala, Sanjoy Chowdhury, Dinesh Manocha

TL;DR

The paper tackles the challenge of generating human-like, platform-agnostic wayfinding instructions for embodied navigation without training on platform-specific datasets. It presents a two-stage method that first extracts rich spatial knowledge from egocentric frames using LLMs and BLIP-based VQA, then uses in-context learning with reference instruction styles to produce diverse, human-like instructions. The approach is validated across Matterport3D, AI Habitat, and ThreeDWorld, with a user study showing 83.3% perceived detail accuracy and zero-shot VLN results closely matching baselines (less than 1% SR change), demonstrating cross-platform viability. The work enables scalable evaluation of embodied navigation policies and provides a data-efficient alternative to platform-specific human annotations, while discussing limitations and directions for broader generalizability and ethical considerations.

Abstract

We present a novel approach to automatically synthesize "wayfinding instructions" for an embodied robot agent. In contrast to prior approaches that are heavily reliant on human-annotated datasets designed exclusively for specific simulation platforms, our algorithm uses in-context learning to condition an LLM to generate instructions using just a few references. Using an LLM-based Visual Question Answering strategy, we gather detailed information about the environment which is used by the LLM for instruction synthesis. We implement our approach on multiple simulation platforms including Matterport3D, AI Habitat and ThreeDWorld, thereby demonstrating its platform-agnostic nature. We subjectively evaluate our approach via a user study and observe that 83.3% of users find the synthesized instructions accurately capture the details of the environment and show characteristics similar to those of human-generated instructions. Further, we conduct zero-shot navigation with multiple approaches on the REVERIE dataset using the generated instructions, and observe very close correlation with the baseline on standard success metrics (< 1% change in SR), quantifying the viability of generated instructions in replacing human-annotated data. We finally discuss the applicability of our approach in enabling a generalizable evaluation of embodied navigation policies. To the best of our knowledge, ours is the first LLM-driven approach capable of generating "human-like" instructions in a platform-agnostic manner, without training.

Can LLMs Generate Human-Like Wayfinding Instructions? Towards Platform-Agnostic Embodied Instruction Synthesis

TL;DR

Abstract

Paper Structure (27 sections, 8 figures, 1 table)

This paper contains 27 sections, 8 figures, 1 table.

Introduction
Approach
Extracting Spatial Knowledge: LLM + BLIP
Synthesizing Wayfinding Instructions via In-Context Learning
Evaluation & Results
Qualitative: User Study
Quantitative: Embodied Navigation
Discussion: Evaluating Generalizability of Embodied Navigation Policies
Conclusion
Limitations and Future Work
Ethics Statement
Acknowledgments
In-Context Learning Strategies
Influence of LLM + BLIP
Empirical Information on Instruction Styles
...and 12 more sections

Figures (8)

Figure 1: Overview: We use in-context learning with an LLM to generate multiple styles of wayfinding instructions for embodied navigation. Given any environment, we first gather a set of egocentric images along a path (white arrows), and obtain spatial knowledge via Visual Question Answering. We then condition an LLM on different styles of instructional language (coarse as well as fine grained) via reference texts. The figure highlights wayfinding instructions for this environment generated without training on any datasets.
Figure 2: Extracting Spatial Knowledge: We use the GPT-3.5-turbo along with BLIP to maximize knowledge captured from an image, similar to ChatCaptioner chatgptblip. We notice that adding more detail to the captions helps improve the quality the final instruction by filtering out unnecessary information. More details about this are in Appendix \ref{['app: fsps']}.
Figure 3: Given any embodied simulator, we synthesize multiple styles of wayfinding instructions for agents. Spatial knowledge is first mined from egocentric images $\mathcal{I}$ captured using the LLM and BLIP. These captions are fed into a prompt along with a few reference examples representing the desired instruction style. Finally, the LLM is conditioned with this prompt to generate a human-like instruction in the style of the reference text, using the captioned information.
Figure 4: Egocentric Image Sequence from a path in ThreeDWorld TDW
Figure 5: Egocentric Image Sequence from a path in AI Habitat hm3d_habitat
...and 3 more figures

Can LLMs Generate Human-Like Wayfinding Instructions? Towards Platform-Agnostic Embodied Instruction Synthesis

TL;DR

Abstract

Can LLMs Generate Human-Like Wayfinding Instructions? Towards Platform-Agnostic Embodied Instruction Synthesis

Authors

TL;DR

Abstract

Table of Contents

Figures (8)