Cross-View World Models
Rishabh Sharma, Gijs Hogervorst, Wayne E. Mackey, David J. Heeger, Stefano Martiniani
TL;DR
Cross-view prediction enables world models to imagine outcomes from multiple viewpoints, not just egocentric perspectives, by enforcing view-invariant 3D representations through a cross-view prediction objective. Trained on synchronized multi-view gameplay data, XVWM produces parallel imagination streams across viewpoints and demonstrates spatial grounding comparable to cognitive maps, as well as positive transfer to same-view predictions. The approach yields novel capabilities such as sub-marker localization, trajectory consistency over long rollouts, and detailed egocentric predictions from a BEV marker, highlighting its potential for planning and multi-agent perspective-taking. This work advances self-supervised world modeling by leveraging cross-view consistency as a geometric regularizer and offers a foundation for robust, perspective-aware planning in complex environments.
Abstract
World models enable agents to plan by imagining future states, but existing approaches operate from a single viewpoint, typically egocentric, even when other perspectives would make planning easier; navigation, for instance, benefits from a bird's-eye view. We introduce Cross-View World Models (XVWM), trained with a cross-view prediction objective: given a sequence of frames from one viewpoint, predict the future state from the same or a different viewpoint after an action is taken. Enforcing cross-view consistency acts as geometric regularization: because the input and output views may share little or no visual overlap, to predict across viewpoints, the model must learn view-invariant representations of the environment's 3D structure. We train on synchronized multi-view gameplay data from Aimlabs, an aim-training platform providing precisely aligned multi-camera recordings with high-frequency action labels. The resulting model gives agents parallel imagination streams across viewpoints, enabling planning in whichever frame of reference best suits the task while executing from the egocentric view. Our results show that multi-view consistency provides a strong learning signal for spatially grounded representations. Finally, predicting the consequences of one's actions from another viewpoint may offer a foundation for perspective-taking in multi-agent settings.
