Table of Contents
Fetching ...

VISTAv2: World Imagination for Indoor Vision-and-Language Navigation

Yanjia Huang, Xianshun Jiang, Xiangbo Gao, Mingyang Wu, Zhengzhong Tu

TL;DR

VISTAv2 introduces a test-time, action-conditioned generative world model that imagines short-horizon egocentric futures conditioned on instructions and candidate actions, then converts these futures into an online egocentric value map. This imagined value is fused at score level with a standard frontier-based planner, preserving the planner while injecting geometry-aware, reachability-guided cues. Through an Imagination-to-Value head and a diffusion-based world model operating in latent space, VISTAv2 achieves consistent improvements in SR and SPL on VLN benchmarks (R2R and RoboTHOR) and demonstrates the importance of action-conditioned imagination and map-space value fusion over semantic priors alone. The approach remains efficient, interpretable, and deployable as a plug-in to existing planners, offering a practical pathway for robust embodied navigation with generative world models.

Abstract

Vision-and-Language Navigation (VLN) requires agents to follow language instructions while acting in continuous real-world spaces. Prior image imagination based VLN work shows benefits for discrete panoramas but lacks online, action-conditioned predictions and does not produce explicit planning values; moreover, many methods replace the planner with long-horizon objectives that are brittle and slow. To bridge this gap, we propose VISTAv2, a generative world model that rolls out egocentric future views conditioned on past observations, candidate action sequences, and instructions, and projects them into an online value map for planning. Unlike prior approaches, VISTAv2 does not replace the planner. The online value map is fused at score level with the base objective, providing reachability and risk-aware guidance. Concretely, we employ an action-aware Conditional Diffusion Transformer video predictor to synthesize short-horizon futures, align them with the natural language instruction via a vision-language scorer, and fuse multiple rollouts in a differentiable imagination-to-value head to output an imagined egocentric value map. For efficiency, rollouts occur in VAE latent space with a distilled sampler and sparse decoding, enabling inference on a single consumer GPU. Evaluated on MP3D and RoboTHOR, VISTAv2 improves over strong baselines, and ablations show that action-conditioned imagination, instruction-guided value fusion, and the online value-map planner are all critical, suggesting that VISTAv2 offers a practical and interpretable route to robust VLN.

VISTAv2: World Imagination for Indoor Vision-and-Language Navigation

TL;DR

VISTAv2 introduces a test-time, action-conditioned generative world model that imagines short-horizon egocentric futures conditioned on instructions and candidate actions, then converts these futures into an online egocentric value map. This imagined value is fused at score level with a standard frontier-based planner, preserving the planner while injecting geometry-aware, reachability-guided cues. Through an Imagination-to-Value head and a diffusion-based world model operating in latent space, VISTAv2 achieves consistent improvements in SR and SPL on VLN benchmarks (R2R and RoboTHOR) and demonstrates the importance of action-conditioned imagination and map-space value fusion over semantic priors alone. The approach remains efficient, interpretable, and deployable as a plug-in to existing planners, offering a practical pathway for robust embodied navigation with generative world models.

Abstract

Vision-and-Language Navigation (VLN) requires agents to follow language instructions while acting in continuous real-world spaces. Prior image imagination based VLN work shows benefits for discrete panoramas but lacks online, action-conditioned predictions and does not produce explicit planning values; moreover, many methods replace the planner with long-horizon objectives that are brittle and slow. To bridge this gap, we propose VISTAv2, a generative world model that rolls out egocentric future views conditioned on past observations, candidate action sequences, and instructions, and projects them into an online value map for planning. Unlike prior approaches, VISTAv2 does not replace the planner. The online value map is fused at score level with the base objective, providing reachability and risk-aware guidance. Concretely, we employ an action-aware Conditional Diffusion Transformer video predictor to synthesize short-horizon futures, align them with the natural language instruction via a vision-language scorer, and fuse multiple rollouts in a differentiable imagination-to-value head to output an imagined egocentric value map. For efficiency, rollouts occur in VAE latent space with a distilled sampler and sparse decoding, enabling inference on a single consumer GPU. Evaluated on MP3D and RoboTHOR, VISTAv2 improves over strong baselines, and ablations show that action-conditioned imagination, instruction-guided value fusion, and the online value-map planner are all critical, suggesting that VISTAv2 offers a practical and interpretable route to robust VLN.

Paper Structure

This paper contains 32 sections, 6 equations, 5 figures, 3 tables, 2 algorithms.

Figures (5)

  • Figure 1: VISTAv2 pipeline overview (§\ref{['sec3.1']}). From a language instruction and observations (RGB, depth, odometry), the agent: (1) builds a local map and proposes frontier-based candidate trajectories; (2) forms a language prior over the map (Value); (3) uses the world model to imagine short-horizon futures and converts them into an egocentric imagined value map; (4) fuses imagined value and the prior with the planner’s native score to rank candidates (Eq. \ref{['eq:fusion-final']}) and executes the first control in a receding-horizon loop.
  • Figure 2: World Model (§\ref{['sec3.2']}). Given the recent egocentric frames $I_{t-m+1:t}$, the instruction $g$, and a candidate trajectory $A$, we integrate poses to obtain $C(A)\!\in\!SE(2)^H$ and feed $(x_t,g,C(A))$ to the action-conditioned video diffusion model $\mathcal{W}_\theta$ (CDiT in VAE latent space). $\mathcal{W}_\theta$ produces a short-horizon egocentric rollout $\{\hat{I}_{t+\tau}\}_{\tau=1}^{H}$; only a stride-$\Delta t$ subset is decoded for downstream I2V scoring (§\ref{['sec3.3']}).
  • Figure 3: Effect of visual imagination on goal discovery. Each panel shows one episode (same start/goal). Left (No Imagination): the base planner explores many frontiers (red circles) guided only by occupancy/prior; the value over the explored region is diffuse (green mask), the agent wanders (132 steps) and fails to localize the TV. Right (With Imagination): our world model rolls out egocentric futures and the I2V head produces a fan-shaped image value map (orange/yellow), which fused with the prior sharpens the score and steers the agent through the doorway to the TV room, succeeding in 30 steps. Top: current depth/RGB and occupancy with path; Right column: value maps; Bottom: frontier set and fused score along the chosen path.
  • Figure 4: Qualitative visualization of the world-model rollout. For two trajectories in MP3D and HM3D. The rollouts capture room layout and semantics (doorways, arches, windows, tables and bookshelves), which are sufficient for planning even when textures appear stylized.
  • Figure 5: Uncertainty gating sweep on R2R (Val-Unseen). SR/SPL versus the gating threshold $\theta$ (right axis: fallback rate). Performance peaks at a moderate $\theta$ as 0.6.