Table of Contents
Fetching ...

Cross-View World Models

Rishabh Sharma, Gijs Hogervorst, Wayne E. Mackey, David J. Heeger, Stefano Martiniani

TL;DR

Cross-view prediction enables world models to imagine outcomes from multiple viewpoints, not just egocentric perspectives, by enforcing view-invariant 3D representations through a cross-view prediction objective. Trained on synchronized multi-view gameplay data, XVWM produces parallel imagination streams across viewpoints and demonstrates spatial grounding comparable to cognitive maps, as well as positive transfer to same-view predictions. The approach yields novel capabilities such as sub-marker localization, trajectory consistency over long rollouts, and detailed egocentric predictions from a BEV marker, highlighting its potential for planning and multi-agent perspective-taking. This work advances self-supervised world modeling by leveraging cross-view consistency as a geometric regularizer and offers a foundation for robust, perspective-aware planning in complex environments.

Abstract

World models enable agents to plan by imagining future states, but existing approaches operate from a single viewpoint, typically egocentric, even when other perspectives would make planning easier; navigation, for instance, benefits from a bird's-eye view. We introduce Cross-View World Models (XVWM), trained with a cross-view prediction objective: given a sequence of frames from one viewpoint, predict the future state from the same or a different viewpoint after an action is taken. Enforcing cross-view consistency acts as geometric regularization: because the input and output views may share little or no visual overlap, to predict across viewpoints, the model must learn view-invariant representations of the environment's 3D structure. We train on synchronized multi-view gameplay data from Aimlabs, an aim-training platform providing precisely aligned multi-camera recordings with high-frequency action labels. The resulting model gives agents parallel imagination streams across viewpoints, enabling planning in whichever frame of reference best suits the task while executing from the egocentric view. Our results show that multi-view consistency provides a strong learning signal for spatially grounded representations. Finally, predicting the consequences of one's actions from another viewpoint may offer a foundation for perspective-taking in multi-agent settings.

Cross-View World Models

TL;DR

Cross-view prediction enables world models to imagine outcomes from multiple viewpoints, not just egocentric perspectives, by enforcing view-invariant 3D representations through a cross-view prediction objective. Trained on synchronized multi-view gameplay data, XVWM produces parallel imagination streams across viewpoints and demonstrates spatial grounding comparable to cognitive maps, as well as positive transfer to same-view predictions. The approach yields novel capabilities such as sub-marker localization, trajectory consistency over long rollouts, and detailed egocentric predictions from a BEV marker, highlighting its potential for planning and multi-agent perspective-taking. This work advances self-supervised world modeling by leveraging cross-view consistency as a geometric regularizer and offers a foundation for robust, perspective-aware planning in complex environments.

Abstract

World models enable agents to plan by imagining future states, but existing approaches operate from a single viewpoint, typically egocentric, even when other perspectives would make planning easier; navigation, for instance, benefits from a bird's-eye view. We introduce Cross-View World Models (XVWM), trained with a cross-view prediction objective: given a sequence of frames from one viewpoint, predict the future state from the same or a different viewpoint after an action is taken. Enforcing cross-view consistency acts as geometric regularization: because the input and output views may share little or no visual overlap, to predict across viewpoints, the model must learn view-invariant representations of the environment's 3D structure. We train on synchronized multi-view gameplay data from Aimlabs, an aim-training platform providing precisely aligned multi-camera recordings with high-frequency action labels. The resulting model gives agents parallel imagination streams across viewpoints, enabling planning in whichever frame of reference best suits the task while executing from the egocentric view. Our results show that multi-view consistency provides a strong learning signal for spatially grounded representations. Finally, predicting the consequences of one's actions from another viewpoint may offer a foundation for perspective-taking in multi-agent settings.
Paper Structure (9 sections, 7 figures, 1 table)

This paper contains 9 sections, 7 figures, 1 table.

Figures (7)

  • Figure 1: Gaining multiple imagination streams. (a) A conventional world model predicts future states from the same viewpoint. (b) An XVWM trained on egocentric and bird's-eye views learns bidirectional cross-view prediction: given either viewpoint as input, it can also imagine the future state from the other. These results yield a functional analog of biological cognitive maps: the model can locate and orient itself in its environment from visual cues alone. (c) Training on four viewpoints yields any-to-any prediction across all $16$ input--output view pairs. Given context from a single viewpoint, XVWM can predict action-conditioned future states from all available perspectives. All predictions shown are 4 seconds into the future with identical state and action conditioning. Only the last of the 4 context frames is shown as input.
  • Figure 2: Same-view prediction quality and transfer.(a) Ego$\to$ego perceptual similarity (LPIPS and DreamSim, lower is better) across the three models. The Single-View baseline sees $100\%$ egocentric pairs, the Two-View model $25\%$, and the Four-View model $12.5\%$. All models are trained with identical compute (same steps, epochs, and batch size). (b, c) Ego$\to$ego quality plotted against training exposure. Blue/orange curves show the Single-View model at intermediate checkpoints; starred and diamond markers show the fully trained Four-View and Two-View XVWMs at their effective exposures. Metrics are computed for a 4-second prediction target, averaged over $1{,}257$ test samples ($3$ predictions per test episode). Error bars show bootstrap $95\%$ confidence intervals.
  • Figure 3: Cross-view prediction quality (all$\to$ego) within and across XVWM models. The Two-View model, trained on complementary, high-signal views, outperforms the Four-View model. This is partly explained by the Two-View model's higher exposure ($25\%$ across all pairs) compared to the Four-View model ($12.5\%$ and $4.17\%$ for same-view and cross-view pairs respectively). However, even within the Four-View model, OS and Front perform considerably worse than BEV at similar exposure levels. Metrics are computed for a 4-second prediction target, averaged over $1{,}257$ test samples ($3$ predictions per test episode). Error bars depict bootstrap $95\%$ confidence intervals.
  • Figure 4: XVWM's internal geolocalization. Localization in an environment is but one component of the mammalian cognitive map. Localization alone is insufficient for navigation, one also need to encode consistent spatial ordering. Here, we probe this property by following three egocentric trajectories and imagining the corresponding path from BEV. Comparing to ground-truth trajectories, we observe that movements in the egocentric view translate to consistent movements on the BEV map. The blue-to-red gradient denotes time increasing from $0.0$s to $20.0$s. At each point, the model predicts the next frame, i.e it predicts $0.2$s into the future. The model possesses an "internal GPS" that allows it to locate and orient itself in its environment from visual cues alone.
  • Figure 5: Spawn anywhere in a known environment. A dramatic consequence of the emergent bi-directionality from cross-view training is the ability to spawn at arbitrary locations. Here, we take a Bird's-eye view trajectory as input (top row) and predict the action-conditioned next step from an egocentric viewpoint. Both models imagine detailed egocentric views along the trajectory, decoding all the necessary information from the location and orientation of the tiny ${\sim}17$px marker alone. This suggests strong emergent understanding of the 3D environment. Notably, the sky varies stochastically when input context is BEV, as expected: the training environment included multiple sky layouts, and BEV provides no information about which sky should appear. This suggests the model has learned spatial structure rather than memorizing input-output pairs. At each instant, the model predicts the next frame, that is $0.2$s into the future.
  • ...and 2 more figures