Table of Contents
Fetching ...

Structured Observation Language for Efficient and Generalizable Vision-Language Navigation

Daojie Peng, Fulong Ma, Jun Ma

Abstract

Vision-Language Navigation (VLN) requires an embodied agent to navigate complex environments by following natural language instructions, which typically demands tight fusion of visual and language modalities. Existing VLN methods often convert raw images into visual tokens or implicit features, requiring large-scale visual pre-training and suffering from poor generalization under environmental variations (e.g., lighting, texture). To address these issues, we propose SOL-Nav (Structured Observation Language for Navigation), a novel framework that translates egocentric visual observations into compact structured language descriptions for efficient and generalizable navigation. Specifically, we divide RGB-D images into a N*N grid, extract representative semantic, color, and depth information for each grid cell to form structured text, and concatenate this with the language instruction as pure language input to a pre-trained language model (PLM). Experimental results on standard VLN benchmarks (R2R, RxR) and real-world deployments demonstrate that SOL-Nav significantly reduces the model size and training data dependency, fully leverages the reasoning and representation capabilities of PLMs, and achieves strong generalization to unseen environments.

Structured Observation Language for Efficient and Generalizable Vision-Language Navigation

Abstract

Vision-Language Navigation (VLN) requires an embodied agent to navigate complex environments by following natural language instructions, which typically demands tight fusion of visual and language modalities. Existing VLN methods often convert raw images into visual tokens or implicit features, requiring large-scale visual pre-training and suffering from poor generalization under environmental variations (e.g., lighting, texture). To address these issues, we propose SOL-Nav (Structured Observation Language for Navigation), a novel framework that translates egocentric visual observations into compact structured language descriptions for efficient and generalizable navigation. Specifically, we divide RGB-D images into a N*N grid, extract representative semantic, color, and depth information for each grid cell to form structured text, and concatenate this with the language instruction as pure language input to a pre-trained language model (PLM). Experimental results on standard VLN benchmarks (R2R, RxR) and real-world deployments demonstrate that SOL-Nav significantly reduces the model size and training data dependency, fully leverages the reasoning and representation capabilities of PLMs, and achieves strong generalization to unseen environments.

Paper Structure

This paper contains 15 sections, 2 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Pipeline of SOL-Nav. RGB-D observations are converted into structured textual descriptions with 2×2/4×4/6×6 multi-resolution grids (long/short-term history, current observation) encoding depth, semantic, and color information. The structured observation sequence, navigation instruction, and system description form a pure language prompt, which is input to a LLM to predict a consecutive action block for the agent.
  • Figure 2: Structured Observation Language Prompt for LLM. The prompt integrates system description ($D_{\text{system}}$), structured observation ($O_{\text{structure}}$), and task instruction ($I_{\text{task}}$) to provide clear system definition, structured observations, and explicit prediction requirements for the language model.
  • Figure 3: Real-world Deployments. We conduct real-world navigation experiments in three distinct scenarios with varying environmental characteristics (Tea Area, Hall Stairs, Meeting Room), to comprehensively evaluate the robustness and generalization of SOL-Nav.