Table of Contents
Fetching ...

RAYNOVA: Scale-Temporal Autoregressive World Modeling in Ray Space

Yichen Xie, Chensheng Peng, Mazen Abdelfattah, Yihan Hu, Jiezhi Yang, Eric Higgins, Ryan Brigden, Masayoshi Tomizuka, Wei Zhan

TL;DR

RAYNOVA is proposed, a geometry-agonistic multiview world model for driving scenarios that employs a dual-causal autoregressive framework, and leverages global attention for unified 4D spatio-temporal reasoning.

Abstract

World foundation models aim to simulate the evolution of the real world with physically plausible behavior. Unlike prior methods that handle spatial and temporal correlations separately, we propose RAYNOVA, a geometry-agonistic multiview world model for driving scenarios that employs a dual-causal autoregressive framework. It follows both scale-wise and temporal topological orders in the autoregressive process, and leverages global attention for unified 4D spatio-temporal reasoning. Different from existing works that impose strong 3D geometric priors, RAYNOVA constructs an isotropic spatio-temporal representation across views, frames, and scales based on relative Plücker-ray positional encoding, enabling robust generalization to diverse camera setups and ego motions. We further introduce a recurrent training paradigm to alleviate distribution drift in long-horizon video generation. RAYNOVA achieves state-of-the-art multi-view video generation results on nuScenes, while offering higher throughput and strong controllability under diverse input conditions, generalizing to novel views and camera configurations without explicit 3D scene representation. Our code will be released at https://raynova-ai.github.io/.

RAYNOVA: Scale-Temporal Autoregressive World Modeling in Ray Space

TL;DR

RAYNOVA is proposed, a geometry-agonistic multiview world model for driving scenarios that employs a dual-causal autoregressive framework, and leverages global attention for unified 4D spatio-temporal reasoning.

Abstract

World foundation models aim to simulate the evolution of the real world with physically plausible behavior. Unlike prior methods that handle spatial and temporal correlations separately, we propose RAYNOVA, a geometry-agonistic multiview world model for driving scenarios that employs a dual-causal autoregressive framework. It follows both scale-wise and temporal topological orders in the autoregressive process, and leverages global attention for unified 4D spatio-temporal reasoning. Different from existing works that impose strong 3D geometric priors, RAYNOVA constructs an isotropic spatio-temporal representation across views, frames, and scales based on relative Plücker-ray positional encoding, enabling robust generalization to diverse camera setups and ego motions. We further introduce a recurrent training paradigm to alleviate distribution drift in long-horizon video generation. RAYNOVA achieves state-of-the-art multi-view video generation results on nuScenes, while offering higher throughput and strong controllability under diverse input conditions, generalizing to novel views and camera configurations without explicit 3D scene representation. Our code will be released at https://raynova-ai.github.io/.
Paper Structure (29 sections, 8 equations, 9 figures, 9 tables, 1 algorithm)

This paper contains 29 sections, 8 equations, 9 figures, 9 tables, 1 algorithm.

Figures (9)

  • Figure 1: Demonstrations of RayNova as Versatile World Foundation Model.
  • Figure 2: Overview of RayNova Framework. RayNova is composed of dual-casual (scale and time) blocks. The local scale attention and local cross attention works on each image indepedently, while the global causal attention works across multi-view and multiframe images enhanced with a unified ray-level relative position embedding for better spatio-temporal consistency.
  • Figure 3: Dual-Causality for Multi-View Video Generation. Green arrows represent the causal dependency, while the darkness indicates the topological order of autoregression (from light to dark).
  • Figure 4: Ablation Study on Scale Causality. Conditioning on all scales in history hurts the modeling of dynamics, while conditioning only on same scale is insufficient for temporal coherence.
  • Figure 5: Ablation Study on Model Size. Large scale model can bring significantly better visual quality.
  • ...and 4 more figures