Table of Contents
Fetching ...

Language-Conditioned World Modeling for Visual Navigation

Yifei Dong, Fengyi Wu, Yilong Dai, Lingdong Kong, Guangyu Chen, Xu Zhu, Qiyu Hu, Tianyu Wang, Johnalbert Garnica, Feng Liu, Siyu Huang, Qi Dai, Zhi-Qi Cheng

Abstract

We study language-conditioned visual navigation (LCVN), in which an embodied agent is asked to follow a natural language instruction based only on an initial egocentric observation. Without access to goal images, the agent must rely on language to shape its perception and continuous control, making the grounding problem particularly challenging. We formulate this problem as open-loop trajectory prediction conditioned on linguistic instructions and introduce the LCVN Dataset, a benchmark of 39,016 trajectories and 117,048 human-verified instructions that supports reproducible research across a range of environments and instruction styles. Using this dataset, we develop LCVN frameworks that link language grounding, future-state prediction, and action generation through two complementary model families. The first family combines LCVN-WM, a diffusion-based world model, with LCVN-AC, an actor-critic agent trained in the latent space of the world model. The second family, LCVN-Uni, adopts an autoregressive multimodal architecture that predicts both actions and future observations. Experiments show that these families offer different advantages: the former provides more temporally coherent rollouts, whereas the latter generalizes better to unseen environments. Taken together, these observations point to the value of jointly studying language grounding, imagination, and policy learning in a unified task setting, and LCVN provides a concrete basis for further investigation of language-conditioned world models. The code is available at https://github.com/F1y1113/LCVN.

Language-Conditioned World Modeling for Visual Navigation

Abstract

We study language-conditioned visual navigation (LCVN), in which an embodied agent is asked to follow a natural language instruction based only on an initial egocentric observation. Without access to goal images, the agent must rely on language to shape its perception and continuous control, making the grounding problem particularly challenging. We formulate this problem as open-loop trajectory prediction conditioned on linguistic instructions and introduce the LCVN Dataset, a benchmark of 39,016 trajectories and 117,048 human-verified instructions that supports reproducible research across a range of environments and instruction styles. Using this dataset, we develop LCVN frameworks that link language grounding, future-state prediction, and action generation through two complementary model families. The first family combines LCVN-WM, a diffusion-based world model, with LCVN-AC, an actor-critic agent trained in the latent space of the world model. The second family, LCVN-Uni, adopts an autoregressive multimodal architecture that predicts both actions and future observations. Experiments show that these families offer different advantages: the former provides more temporally coherent rollouts, whereas the latter generalizes better to unseen environments. Taken together, these observations point to the value of jointly studying language grounding, imagination, and policy learning in a unified task setting, and LCVN provides a concrete basis for further investigation of language-conditioned world models. The code is available at https://github.com/F1y1113/LCVN.

Paper Structure

This paper contains 22 sections, 17 equations, 6 figures, 10 tables, 3 algorithms.

Figures (6)

  • Figure 1: Language-Conditioned Visual Navigation (LCVN). Given only an initial egocentric observation and a language instruction, the agent generates the entire future trajectory without environmental feedback, imagining intermediate states (➀–➃) along the described route. LCVN-WM performs language- and action-conditioned latent rollouts, LCVN-AC selects latent actions via intrinsic rewards, and LCVN-Uni autoregressively predicts both the next action and observation.
  • Figure 2: Language-conditioned world model (LCVN-WM) and associated agent (LCVN-AC). Training Phase 1: LCVN-WM is trained with Diffusion Forcing chen2024diffusionsong2025history to predict future latent observations $\hat{s}_{t+1}$ (Eq. \ref{['eq:LCVN-wm']}) from noisy context latents at independent noise levels (Eq. \ref{['eq:df']}), conditioned on actions $\hat{a}_t$, instruction $I$, time shift $t_s$, and diffusion timestep. Training Phase 2: LCVN-AC is trained in LCVN-WM's latent space, aligning expert and learner plans via KL divergence (Eq. \ref{['eq:kl']}). Actor--critic optimization uses intrinsic rewards measuring agreement between predicted and expert latent rollouts (Eqs. \ref{['eq:ac_reward_wm']}--\ref{['eq:ac-actor']}). Inference Stage: Given latent states $\hat{\mathbf{s}}_t$ and instruction $I$, LCVN-WM predicts $\hat{s}_{t+1}$ (Eq. \ref{['eq:LCVN-wm']}) and LCVN-AC generates $\hat{a}_{t+1}$ conditioned on predicted latent and instruction embedding (Eq. \ref{['eq:LCVN-ac']}).
  • Figure 3: LCVN-Uni architecture. LCVN-Uni unifies navigation planning and world modeling within an autoregressive MLLM backbone. Actions $\hat{a}_t$, instructions $I$, and observations $o_s, \hat{o}_t$ are tokenized by bin, BPE, and VQ tokenizers, respectively, then fused into a unified sequence for joint modeling. In a single forward pass, the agent predicts both next action and observation, trained under a combined objective balancing planning and imagination. During inference, $\hat{o}_t$ is the observation predicted by the model at the previous step, not from the environment.
  • Figure 4: Qualitative Comparisons on LCVN val seen split across NWM + LCVN-AC and LCVN agents. LCVN agents exhibit stronger sensitivity to directional changes and are less prone to losing key landmarks during world modeling.
  • Figure 5: Qualitative comparisons of language guidance showcase both LCVN-Uni and LCVN-WM better preserve semantic consistency across states with language guidance.
  • ...and 1 more figures