Table of Contents
Fetching ...

PROSPECT: Unified Streaming Vision-Language Navigation via Semantic--Spatial Fusion and Latent Predictive Representation

Zehua Fan, Wenqi Lyu, Wenxuan Song, Linge Zhao, Yifei Yang, Xi Wang, Junjie He, Lida Huang, Haiyan Liu, Bingchuan Sun, Guangjun Bao, Xuanyao Mao, Liang Xu, Yan Wang, Feng Gao

TL;DR

PROSPECT is a unified streaming navigation agent that couples a streaming Vision-Language-Action (VLA) policy with latent predictive representation learning, and uses CUT3R as a streaming 3D foundation spatial encoder to produce long-context, absolute-scale spatial features, and fuses them with SigLIP semantic features via cross-attention.

Abstract

Multimodal large language models (MLLMs) have advanced zero-shot end-to-end Vision-Language Navigation (VLN), yet robust navigation requires not only semantic understanding but also predictive modeling of environment dynamics and spatial structure. We propose PROSPECT, a unified streaming navigation agent that couples a streaming Vision-Language-Action (VLA) policy with latent predictive representation learning. PROSPECT uses CUT3R as a streaming 3D foundation spatial encoder to produce long-context, absolute-scale spatial features, and fuses them with SigLIP semantic features via cross-attention. During training, we introduce learnable stream query tokens that query the streaming context and predict next-step 2D and 3D latent features (rather than pixels or explicit modalities), supervised in the latent spaces of frozen SigLIP and CUT3R teachers. The predictive branch shapes internal representations without inference overhead. Experiments on VLN-CE benchmarks and real-robot deployment demonstrate state-of-the-art performance and improved long-horizon robustness under diverse lighting. We will release code for the community soon.

PROSPECT: Unified Streaming Vision-Language Navigation via Semantic--Spatial Fusion and Latent Predictive Representation

TL;DR

PROSPECT is a unified streaming navigation agent that couples a streaming Vision-Language-Action (VLA) policy with latent predictive representation learning, and uses CUT3R as a streaming 3D foundation spatial encoder to produce long-context, absolute-scale spatial features, and fuses them with SigLIP semantic features via cross-attention.

Abstract

Multimodal large language models (MLLMs) have advanced zero-shot end-to-end Vision-Language Navigation (VLN), yet robust navigation requires not only semantic understanding but also predictive modeling of environment dynamics and spatial structure. We propose PROSPECT, a unified streaming navigation agent that couples a streaming Vision-Language-Action (VLA) policy with latent predictive representation learning. PROSPECT uses CUT3R as a streaming 3D foundation spatial encoder to produce long-context, absolute-scale spatial features, and fuses them with SigLIP semantic features via cross-attention. During training, we introduce learnable stream query tokens that query the streaming context and predict next-step 2D and 3D latent features (rather than pixels or explicit modalities), supervised in the latent spaces of frozen SigLIP and CUT3R teachers. The predictive branch shapes internal representations without inference overhead. Experiments on VLN-CE benchmarks and real-robot deployment demonstrate state-of-the-art performance and improved long-horizon robustness under diverse lighting. We will release code for the community soon.
Paper Structure (24 sections, 12 equations, 4 figures, 6 tables)

This paper contains 24 sections, 12 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Overview of PROSPECT. (a) Streaming setup: A streaming attention mask enforces temporal causality and isolates 2D/3D query tokens to prevent cross-modal leakage. SigLIP and CUT3R provide 2D semantic and absolute-scale 3D spatial feature streams, fused by cross-attention for the policy. (b) Unified model: In training, stream query tokens predict next-step 2D/3D latent features under frozen SigLIP/CUT3R supervision (no inference cost). At inference, only the VLA policy runs at $\sim$4 Hz. (c) Results: First-tier VLN-CE performance and zero-shot Habitat navigation; larger gains on the long-horizon RxR benchmark than on R2R, indicating stronger robustness for complex instruction following. Real-robot deployment is robust under diverse lighting.
  • Figure 2: Architecture of PROSPECT. Instruction and observations (historical keyframes and current frame) share one pipeline: frozen SigLIP and CUT3R with cross-attention fusion; keyframes are condensed into long-term memory $M$. The model uses a KV cache for context and autoregressively outputs navigation actions. Training only: 2D/3D query tokens reverse-query the stream; lightweight decoders predict next-step latents under cosine (2D) and MSE (3D) with frozen teachers. Predictive branch removed at inference.
  • Figure 3: Streaming attention mask used by PROSPECT. Upper (gray): Causal mask for navigation context (ctxt) and actions (act): each $\text{act}_i$ may attend only to $\text{ctxt}_{0:i}$ and $\text{act}_{0:i-1}$, ensuring no future leakage. Middle (red): Each 2D query token $\langle\text{Query2d}_i\rangle$ attends only to its own round and prior rounds' ctxt/act; it cannot attend to any other Query2d, any Query3d, or future rounds---enforcing both turn isolation and modality disentanglement. Lower (blue): Same for 3D query tokens $\langle\text{Query3d}_i\rangle$.
  • Figure 4: First-person views from ARX-Lift2 under diverse indoor/outdoor lighting.