Table of Contents
Fetching ...

Ground Slow, Move Fast: A Dual-System Foundation Model for Generalizable Vision-and-Language Navigation

Meng Wei, Chenyang Wan, Jiaqi Peng, Xiqian Yu, Yuqiang Yang, Delin Feng, Wenzhe Cai, Chenming Zhu, Tai Wang, Jiangmiao Pang, Xihui Liu

TL;DR

This work addresses the limitations of end-to-end vision-language navigation by introducing DualVLN, a dual-system foundation model that decouples high-level semantic reasoning from low-level motion control. System 2 provides slow, robust pixel-goal grounding and mid-term waypoint planning, while System 1 rapidly converts those goals into smooth, obstacle-aware trajectories through a diffusion-based policy, connected via learnable latent queries. The approach achieves state-of-the-art results on VLN-CE and VLN-PE benchmarks, introduces the Social-VLN benchmark for social awareness, and demonstrates robust real-world performance across multiple robotic platforms. The combination of explicit pixel-goals and latent goal representations, together with asynchronous inference, offers improved generalization, interpretability, and real-time adaptability for embodied navigation in dynamic environments.

Abstract

While recent large vision-language models (VLMs) have improved generalization in vision-language navigation (VLN), existing methods typically rely on end-to-end pipelines that map vision-language inputs directly to short-horizon discrete actions. Such designs often produce fragmented motions, incur high latency, and struggle with real-world challenges like dynamic obstacle avoidance. We propose DualVLN, the first dual-system VLN foundation model that synergistically integrates high-level reasoning with low-level action execution. System 2, a VLM-based global planner, "grounds slowly" by predicting mid-term waypoint goals via image-grounded reasoning. System 1, a lightweight, multi-modal conditioning Diffusion Transformer policy, "moves fast" by leveraging both explicit pixel goals and latent features from System 2 to generate smooth and accurate trajectories. The dual-system design enables robust real-time control and adaptive local decision-making in complex, dynamic environments. By decoupling training, the VLM retains its generalization, while System 1 achieves interpretable and effective local navigation. DualVLN outperforms prior methods across all VLN benchmarks and real-world experiments demonstrate robust long-horizon planning and real-time adaptability in dynamic environments.

Ground Slow, Move Fast: A Dual-System Foundation Model for Generalizable Vision-and-Language Navigation

TL;DR

This work addresses the limitations of end-to-end vision-language navigation by introducing DualVLN, a dual-system foundation model that decouples high-level semantic reasoning from low-level motion control. System 2 provides slow, robust pixel-goal grounding and mid-term waypoint planning, while System 1 rapidly converts those goals into smooth, obstacle-aware trajectories through a diffusion-based policy, connected via learnable latent queries. The approach achieves state-of-the-art results on VLN-CE and VLN-PE benchmarks, introduces the Social-VLN benchmark for social awareness, and demonstrates robust real-world performance across multiple robotic platforms. The combination of explicit pixel-goals and latent goal representations, together with asynchronous inference, offers improved generalization, interpretability, and real-time adaptability for embodied navigation in dynamic environments.

Abstract

While recent large vision-language models (VLMs) have improved generalization in vision-language navigation (VLN), existing methods typically rely on end-to-end pipelines that map vision-language inputs directly to short-horizon discrete actions. Such designs often produce fragmented motions, incur high latency, and struggle with real-world challenges like dynamic obstacle avoidance. We propose DualVLN, the first dual-system VLN foundation model that synergistically integrates high-level reasoning with low-level action execution. System 2, a VLM-based global planner, "grounds slowly" by predicting mid-term waypoint goals via image-grounded reasoning. System 1, a lightweight, multi-modal conditioning Diffusion Transformer policy, "moves fast" by leveraging both explicit pixel goals and latent features from System 2 to generate smooth and accurate trajectories. The dual-system design enables robust real-time control and adaptive local decision-making in complex, dynamic environments. By decoupling training, the VLM retains its generalization, while System 1 achieves interpretable and effective local navigation. DualVLN outperforms prior methods across all VLN benchmarks and real-world experiments demonstrate robust long-horizon planning and real-time adaptability in dynamic environments.

Paper Structure

This paper contains 30 sections, 3 equations, 11 figures, 4 tables.

Figures (11)

  • Figure 1: The proposed dual-system framework decouples high-level reasoning from low-level control. System 2 (slow, 2 Hz) uses a 7B pretrained VLM to generate pixel goal and latent goal, while System 1 (fast, 30 Hz) is a lightweight diffusion-based policy that converts the goals into smooth trajectories with high-frequency RGB inputs. The asynchronous inference enables continuous and smooth navigation process. DualVLN sets a new state-of-the-art on VLN-CE and VLN-PE, and shows strong generalization in real-world deployments.
  • Figure 2: Overview of DualVLN. System 2 takes as input a sequence of egocentric images and the instruction to predict either view-adjustment actions or a 2D pixel coordinate within the image for the next navigation waypoint. System 1 then takes as input both the latent goal embeddings and high-frequency RGB inputs, then generates continuous trajectories for the robot to follow through a diffusion-based policy.
  • Figure 3: Typical robot-humanoid interactions that pose key challenges to the robot's human-aware obstacle avoidance capabilities, including not only situations with a single agent but also cases involving multiple humanoids simultaneously.
  • Figure 4: Qualitative Results of Social-VLN Experiments.
  • Figure 5: Evaluation Metrics of Real-World Experiments.
  • ...and 6 more figures