Table of Contents
Fetching ...

Sparse Video Generation Propels Real-World Beyond-the-View Vision-Language Navigation

Hai Zhang, Siqi Liang, Li Chen, Yuxian Li, Yukuan Xu, Yichao Zhong, Fu Zhang, Hongyang Li

TL;DR

Beyond-the-View Navigation (BVN) demands autonomous long-horizon planning to reach distant targets with minimal guidance. The authors introduce SparseVideoNav, a sparse video generation-based navigation system that replaces dense, step-by-step instructions with long-horizon foresight by predicting a sparse future and inferring actions through a four-stage training pipeline. Built on 140 hours of real-world data, SparseVideoNav achieves sub-second trajectory inference over a $20$-second horizon and up to a 27-fold speed-up, while outperforming state-of-the-art LLM baselines on BVN in diverse real-world settings including challenging night scenes. This work demonstrates the viability of video-generation priors for embodied AI, offering a practical path toward scalable BVN with real-world impact and outlining directions for scaling data and accelerating inference further.

Abstract

Why must vision-language navigation be bound to detailed and verbose language instructions? While such details ease decision-making, they fundamentally contradict the goal for navigation in the real-world. Ideally, agents should possess the autonomy to navigate in unknown environments guided solely by simple and high-level intents. Realizing this ambition introduces a formidable challenge: Beyond-the-View Navigation (BVN), where agents must locate distant, unseen targets without dense and step-by-step guidance. Existing large language model (LLM)-based methods, though adept at following dense instructions, often suffer from short-sighted behaviors due to their reliance on short-horimzon supervision. Simply extending the supervision horizon, however, destabilizes LLM training. In this work, we identify that video generation models inherently benefit from long-horizon supervision to align with language instructions, rendering them uniquely suitable for BVN tasks. Capitalizing on this insight, we propose introducing the video generation model into this field for the first time. Yet, the prohibitive latency for generating videos spanning tens of seconds makes real-world deployment impractical. To bridge this gap, we propose SparseVideoNav, achieving sub-second trajectory inference guided by a generated sparse future spanning a 20-second horizon. This yields a remarkable 27x speed-up compared to the unoptimized counterpart. Extensive real-world zero-shot experiments demonstrate that SparseVideoNav achieves 2.5x the success rate of state-of-the-art LLM baselines on BVN tasks and marks the first realization of such capability in challenging night scenes.

Sparse Video Generation Propels Real-World Beyond-the-View Vision-Language Navigation

TL;DR

Beyond-the-View Navigation (BVN) demands autonomous long-horizon planning to reach distant targets with minimal guidance. The authors introduce SparseVideoNav, a sparse video generation-based navigation system that replaces dense, step-by-step instructions with long-horizon foresight by predicting a sparse future and inferring actions through a four-stage training pipeline. Built on 140 hours of real-world data, SparseVideoNav achieves sub-second trajectory inference over a -second horizon and up to a 27-fold speed-up, while outperforming state-of-the-art LLM baselines on BVN in diverse real-world settings including challenging night scenes. This work demonstrates the viability of video-generation priors for embodied AI, offering a practical path toward scalable BVN with real-world impact and outlining directions for scaling data and accelerating inference further.

Abstract

Why must vision-language navigation be bound to detailed and verbose language instructions? While such details ease decision-making, they fundamentally contradict the goal for navigation in the real-world. Ideally, agents should possess the autonomy to navigate in unknown environments guided solely by simple and high-level intents. Realizing this ambition introduces a formidable challenge: Beyond-the-View Navigation (BVN), where agents must locate distant, unseen targets without dense and step-by-step guidance. Existing large language model (LLM)-based methods, though adept at following dense instructions, often suffer from short-sighted behaviors due to their reliance on short-horimzon supervision. Simply extending the supervision horizon, however, destabilizes LLM training. In this work, we identify that video generation models inherently benefit from long-horizon supervision to align with language instructions, rendering them uniquely suitable for BVN tasks. Capitalizing on this insight, we propose introducing the video generation model into this field for the first time. Yet, the prohibitive latency for generating videos spanning tens of seconds makes real-world deployment impractical. To bridge this gap, we propose SparseVideoNav, achieving sub-second trajectory inference guided by a generated sparse future spanning a 20-second horizon. This yields a remarkable 27x speed-up compared to the unoptimized counterpart. Extensive real-world zero-shot experiments demonstrate that SparseVideoNav achieves 2.5x the success rate of state-of-the-art LLM baselines on BVN tasks and marks the first realization of such capability in challenging night scenes.
Paper Structure (23 sections, 5 equations, 13 figures, 4 tables)

This paper contains 23 sections, 5 equations, 13 figures, 4 tables.

Figures (13)

  • Figure 1: In this work, we investigate the beyond-the-view navigation task in the real world, where agents must locate distant, unseen targets without step-by-step guidance. Traditional large language model-based methods suffer from short-horizon supervision, leading to short-sighted behaviors, e.g., unexpected turning and dead-end trapping. We address this challenge from a new perspective, by introducing the video generation model to this field for the first time. The whole training pipeline is sparsified further for the sake of extended prediction horizon and computational efficiency.
  • Figure 2: Architecture and four-stage training pipeline of SparseVideoNav. (Top) denotes our whole training architecture. Current observation, historical observations, and the language instruction are fed into the video generation model (VGM) backbone to generate future sparse video latents. DiT-based action head then predicts continuous actions conditioned on generated sparse future and the language instruction. (Bottom) denotes our four-stage training pipeline, with Stage 1 (\ref{['sec:stage1']}) adapting T2V to I2V, Stage 2 (\ref{['sec:stage2']}) injecting history into I2V backbone; Stage 3 (\ref{['sec:stage3']}) distilling the backbone to reduce denoising steps; Stage 4 (\ref{['sec:stage4']}) learning actions based on generated sparse future. Components not utilized in a specific stage are indicated by gray blocks.
  • Figure 3: Qualitative comparison of different sparse intervals. With the sparse interval of 3, the model successfully imagines a path towards beyond-the-view target, while maintaining visual fidelity.
  • Figure 4: Qualitative results of zero-shot beyond-the-view navigation in challenging, unstructured environments. SparseVideoNav successfully navigates through challenging scenarios, including dead ends, narrow accessible ramp, and hillside with high inclination angles.
  • Figure 5: Analysis of video generation results of SparseVideoNav during zero-shot deployment in beyond-the-view navigation.
  • ...and 8 more figures