Table of Contents
Fetching ...

RAE-NWM: Navigation World Model in Dense Visual Representation Space

Mingkun Zhang, Wangtian Shen, Fan Zhang, Haijian Qin, Zihao Pei, Ziyang Meng

TL;DR

The Representation Autoencoder-based Navigation World Model (RAE-NWM) is proposed, which models navigation dynamics in a dense visual representation space and shows that modeling sequential rollouts in this space improves structural stability and action accuracy, benefiting downstream planning and navigation.

Abstract

Visual navigation requires agents to reach goals in complex environments through perception and planning. World models address this task by simulating action-conditioned state transitions to predict future observations. Current navigation world models typically learn state evolution under actions within the compressed latent space of a Variational Autoencoder, where spatial compression often discards fine-grained structural information and hinders precise control. To better understand the propagation characteristics of different representations, we conduct a linear dynamics probe and observe that dense DINOv2 features exhibit stronger linear predictability for action-conditioned transitions. Motivated by this observation, we propose the Representation Autoencoder-based Navigation World Model (RAE-NWM), which models navigation dynamics in a dense visual representation space. We employ a Conditional Diffusion Transformer with Decoupled Diffusion Transformer head (CDiT-DH) to model continuous transitions, and introduce a separate time-driven gating module for dynamics conditioning to regulate action injection strength during generation. Extensive evaluations show that modeling sequential rollouts in this space improves structural stability and action accuracy, benefiting downstream planning and navigation.

RAE-NWM: Navigation World Model in Dense Visual Representation Space

TL;DR

The Representation Autoencoder-based Navigation World Model (RAE-NWM) is proposed, which models navigation dynamics in a dense visual representation space and shows that modeling sequential rollouts in this space improves structural stability and action accuracy, benefiting downstream planning and navigation.

Abstract

Visual navigation requires agents to reach goals in complex environments through perception and planning. World models address this task by simulating action-conditioned state transitions to predict future observations. Current navigation world models typically learn state evolution under actions within the compressed latent space of a Variational Autoencoder, where spatial compression often discards fine-grained structural information and hinders precise control. To better understand the propagation characteristics of different representations, we conduct a linear dynamics probe and observe that dense DINOv2 features exhibit stronger linear predictability for action-conditioned transitions. Motivated by this observation, we propose the Representation Autoencoder-based Navigation World Model (RAE-NWM), which models navigation dynamics in a dense visual representation space. We employ a Conditional Diffusion Transformer with Decoupled Diffusion Transformer head (CDiT-DH) to model continuous transitions, and introduce a separate time-driven gating module for dynamics conditioning to regulate action injection strength during generation. Extensive evaluations show that modeling sequential rollouts in this space improves structural stability and action accuracy, benefiting downstream planning and navigation.
Paper Structure (44 sections, 12 equations, 17 figures, 4 tables)

This paper contains 44 sections, 12 equations, 17 figures, 4 tables.

Figures (17)

  • Figure 1: Long-horizon generation comparison. We compare sequential rollouts from the baseline VAE-based NWM versus our DINO-based RAE-NWM. The VAE-based model exhibits severe structural degradation at later horizons ($k=12s, 16s$), whereas RAE-NWM maintains strong structural integrity throughout the sequence.
  • Figure 2: Linear predictability of action-conditioned state transitions across various visual representation spaces, measured by the global $R^2$ score over the prediction horizon step $k$.
  • Figure 3: Architecture of the proposed RAE-NWM. Our model encodes context frames into a sequence of tokens $\mathbf{z}_{\mathrm{cond},i}$ via a frozen DINOv2. The dynamics conditioning module then integrates the agent motion $\mathbf{a}_{i\rightarrow i+k}$ and prediction horizon $k$ with the flow time $t$ through a time-driven gating mechanism. Finally, the CDiT-DH backbone predicts the flow velocity $\mathbf{v}_\theta$.
  • Figure 4: Training and sequential rollout pipelines of RAE-NWM. During training, the generative network is optimized via a flow matching objective to predict the velocity field $\mathbf{v}_\theta$ matching the target velocity $\mathbf{u}^{(t)}$. During inference, an ordinary differential equation (ODE) solver sequentially generates future states $\{\hat{\mathbf{z}}_{i+k}, \dots\}$ along a given action sequence $\{(\mathbf{a}_{i\rightarrow i+k}, k), \dots\}$ within a closed representation-space rollout loop. The frozen RAE decoder is applied exclusively for final pixel-level visualization.
  • Figure 5: Qualitative results of long-horizon generation. (a) Direct prediction at the 16-second horizon. NWM produces observations that completely deviate from the ground truth, whereas RAE-NWM maintains high geometric fidelity, demonstrating superior action-conditioned dynamics. (b) Sequential rollouts of RAE-NWM on the RECON and SCAND datasets, exhibiting strong structural consistency and spatial stability over extended horizons.
  • ...and 12 more figures