Table of Contents
Fetching ...

LangNav: Language as a Perceptual Representation for Navigation

Bowen Pan, Rameswar Panda, SouYoung Jin, Rogerio Feris, Aude Oliva, Phillip Isola, Yoon Kim

TL;DR

This work addresses vision-language navigation (VLN) in low-data regimes by proposing LangNav, which uses language as the perceptual representation of a scene. Visual observations are converted into text via off-the-shelf captioning and object-detection systems, and a pretrained language model is finetuned to predict navigation actions from the textual descriptions, instruction, and trajectory history. The paper presents three case studies: (i) synthetic data generation from GPT-4 to train a smaller LM, (ii) language-based domain transfer from ALFRED to R2R, and (iii) augmentation of vision with language features to improve VLN performance. Findings show data-efficient improvements over vision-based baselines in low-data settings and some transfer benefits, while acknowledging that full-data performance still favors traditional vision-based methods; nonetheless, LangNav offers interpretability and robust transfer potential, especially when combined with vision features.

Abstract

We explore the use of language as a perceptual representation for vision-and-language navigation (VLN), with a focus on low-data settings. Our approach uses off-the-shelf vision systems for image captioning and object detection to convert an agent's egocentric panoramic view at each time step into natural language descriptions. We then finetune a pretrained language model to select an action, based on the current view and the trajectory history, that would best fulfill the navigation instructions. In contrast to the standard setup which adapts a pretrained language model to work directly with continuous visual features from pretrained vision models, our approach instead uses (discrete) language as the perceptual representation. We explore several use cases of our language-based navigation (LangNav) approach on the R2R VLN benchmark: generating synthetic trajectories from a prompted language model (GPT-4) with which to finetune a smaller language model; domain transfer where we transfer a policy learned on one simulated environment (ALFRED) to another (more realistic) environment (R2R); and combining both vision- and language-based representations for VLN. Our approach is found to improve upon baselines that rely on visual features in settings where only a few expert trajectories (10-100) are available, demonstrating the potential of language as a perceptual representation for navigation.

LangNav: Language as a Perceptual Representation for Navigation

TL;DR

This work addresses vision-language navigation (VLN) in low-data regimes by proposing LangNav, which uses language as the perceptual representation of a scene. Visual observations are converted into text via off-the-shelf captioning and object-detection systems, and a pretrained language model is finetuned to predict navigation actions from the textual descriptions, instruction, and trajectory history. The paper presents three case studies: (i) synthetic data generation from GPT-4 to train a smaller LM, (ii) language-based domain transfer from ALFRED to R2R, and (iii) augmentation of vision with language features to improve VLN performance. Findings show data-efficient improvements over vision-based baselines in low-data settings and some transfer benefits, while acknowledging that full-data performance still favors traditional vision-based methods; nonetheless, LangNav offers interpretability and robust transfer potential, especially when combined with vision features.

Abstract

We explore the use of language as a perceptual representation for vision-and-language navigation (VLN), with a focus on low-data settings. Our approach uses off-the-shelf vision systems for image captioning and object detection to convert an agent's egocentric panoramic view at each time step into natural language descriptions. We then finetune a pretrained language model to select an action, based on the current view and the trajectory history, that would best fulfill the navigation instructions. In contrast to the standard setup which adapts a pretrained language model to work directly with continuous visual features from pretrained vision models, our approach instead uses (discrete) language as the perceptual representation. We explore several use cases of our language-based navigation (LangNav) approach on the R2R VLN benchmark: generating synthetic trajectories from a prompted language model (GPT-4) with which to finetune a smaller language model; domain transfer where we transfer a policy learned on one simulated environment (ALFRED) to another (more realistic) environment (R2R); and combining both vision- and language-based representations for VLN. Our approach is found to improve upon baselines that rely on visual features in settings where only a few expert trajectories (10-100) are available, demonstrating the potential of language as a perceptual representation for navigation.
Paper Structure (43 sections, 1 equation, 6 figures, 7 tables)

This paper contains 43 sections, 1 equation, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Overview of language-based navigation (LangNav). We describe the task instructions and visual observations (from off-the-shelf vision systems) through text. A language model is then finetuned to predict which direction to move towards based on the language descriptions. Here, views A, B, and C correspond to the front, left, and rear views of the agent.
  • Figure 2: Pipeline for generating synthetic navigation trajectories from GPT-4. We first prompt GPT-4 with 3 randomly sampled navigation instructions $U$ to generate 10 more synthetic navigation instructions (Phase 1). Then for each generated navigation instruction, we prompt GPT-4 to generate the trajectory that fulfills the generated instruction (Phase 2). See \ref{['appx:temp']} for details.
  • Figure 3: An example of a generated trajectory from GPT-4. The example demonstrates a generated trajectory by following the pipeline in Figure \ref{['fig:pipeline']}. See more examples in \ref{['appx-gen']}.
  • Figure 4: Interpreting and editing a model's predictions through language. At the beginning, the agent incorrectly selected "candidate 2" to ascend the stairs. The failure might stem from the ambiguous interpretation of mistaking the stairs for a hallway in "candidate 1". After editing the description (marked in green), the agent correctly alters its choice to walk down the stairs.
  • Figure 5: Task gap between ALFRED and R2R. We highlight notable distinctions between the navigation tasks in ALFRED and R2R, encompassing variations in appearance, step size, and instruction complexity. See \ref{['appx:gap']} for more details.
  • ...and 1 more figures