VISTAv2: World Imagination for Indoor Vision-and-Language Navigation
Yanjia Huang, Xianshun Jiang, Xiangbo Gao, Mingyang Wu, Zhengzhong Tu
TL;DR
VISTAv2 introduces a test-time, action-conditioned generative world model that imagines short-horizon egocentric futures conditioned on instructions and candidate actions, then converts these futures into an online egocentric value map. This imagined value is fused at score level with a standard frontier-based planner, preserving the planner while injecting geometry-aware, reachability-guided cues. Through an Imagination-to-Value head and a diffusion-based world model operating in latent space, VISTAv2 achieves consistent improvements in SR and SPL on VLN benchmarks (R2R and RoboTHOR) and demonstrates the importance of action-conditioned imagination and map-space value fusion over semantic priors alone. The approach remains efficient, interpretable, and deployable as a plug-in to existing planners, offering a practical pathway for robust embodied navigation with generative world models.
Abstract
Vision-and-Language Navigation (VLN) requires agents to follow language instructions while acting in continuous real-world spaces. Prior image imagination based VLN work shows benefits for discrete panoramas but lacks online, action-conditioned predictions and does not produce explicit planning values; moreover, many methods replace the planner with long-horizon objectives that are brittle and slow. To bridge this gap, we propose VISTAv2, a generative world model that rolls out egocentric future views conditioned on past observations, candidate action sequences, and instructions, and projects them into an online value map for planning. Unlike prior approaches, VISTAv2 does not replace the planner. The online value map is fused at score level with the base objective, providing reachability and risk-aware guidance. Concretely, we employ an action-aware Conditional Diffusion Transformer video predictor to synthesize short-horizon futures, align them with the natural language instruction via a vision-language scorer, and fuse multiple rollouts in a differentiable imagination-to-value head to output an imagined egocentric value map. For efficiency, rollouts occur in VAE latent space with a distilled sampler and sparse decoding, enabling inference on a single consumer GPU. Evaluated on MP3D and RoboTHOR, VISTAv2 improves over strong baselines, and ablations show that action-conditioned imagination, instruction-guided value fusion, and the online value-map planner are all critical, suggesting that VISTAv2 offers a practical and interpretable route to robust VLN.
