Table of Contents
Fetching ...

ResWorld: Temporal Residual World Model for End-to-End Autonomous Driving

Jinqing Zhang, Zehua Fu, Zelin Xu, Wenying Dai, Qingjie Liu, Yunhong Wang

TL;DR

ResWorld introduces a Temporal Residual World Model (TR-World) that isolates dynamic objects by computing temporal residuals of BEV scene representations, thereby reducing redundant static-object modeling. It further couples this with Future-Guided Trajectory Refinement (FGTR), which uses predicted future BEV features to refine prior trajectories and provide sparse supervision to the latent world model, mitigating collapse. Empirical results on nuScenes and NAVSIM show state-of-the-art planning performance and robust trajectory refinement, with ablations confirming the complementary benefits of TR-World and FGTR. The approach offers a scalable, end-to-end framework that emphasizes dynamic-object forecasting and trajectory-world interaction, advancing planning accuracy in autonomous driving.

Abstract

The comprehensive understanding capabilities of world models for driving scenarios have significantly improved the planning accuracy of end-to-end autonomous driving frameworks. However, the redundant modeling of static regions and the lack of deep interaction with trajectories hinder world models from exerting their full effectiveness. In this paper, we propose Temporal Residual World Model (TR-World), which focuses on dynamic object modeling. By calculating the temporal residuals of scene representations, the information of dynamic objects can be extracted without relying on detection and tracking. TR-World takes only temporal residuals as input, thus predicting the future spatial distribution of dynamic objects more precisely. By combining the prediction with the static object information contained in the current BEV features, accurate future BEV features can be obtained. Furthermore, we propose Future-Guided Trajectory Refinement (FGTR) module, which conducts interaction between prior trajectories (predicted from the current scene representation) and the future BEV features. This module can not only utilize future road conditions to refine trajectories, but also provides sparse spatial-temporal supervision on future BEV features to prevent world model collapse. Comprehensive experiments conducted on the nuScenes and NAVSIM datasets demonstrate that our method, namely ResWorld, achieves state-of-the-art planning performance. The code is available at https://github.com/mengtan00/ResWorld.git.

ResWorld: Temporal Residual World Model for End-to-End Autonomous Driving

TL;DR

ResWorld introduces a Temporal Residual World Model (TR-World) that isolates dynamic objects by computing temporal residuals of BEV scene representations, thereby reducing redundant static-object modeling. It further couples this with Future-Guided Trajectory Refinement (FGTR), which uses predicted future BEV features to refine prior trajectories and provide sparse supervision to the latent world model, mitigating collapse. Empirical results on nuScenes and NAVSIM show state-of-the-art planning performance and robust trajectory refinement, with ablations confirming the complementary benefits of TR-World and FGTR. The approach offers a scalable, end-to-end framework that emphasizes dynamic-object forecasting and trajectory-world interaction, advancing planning accuracy in autonomous driving.

Abstract

The comprehensive understanding capabilities of world models for driving scenarios have significantly improved the planning accuracy of end-to-end autonomous driving frameworks. However, the redundant modeling of static regions and the lack of deep interaction with trajectories hinder world models from exerting their full effectiveness. In this paper, we propose Temporal Residual World Model (TR-World), which focuses on dynamic object modeling. By calculating the temporal residuals of scene representations, the information of dynamic objects can be extracted without relying on detection and tracking. TR-World takes only temporal residuals as input, thus predicting the future spatial distribution of dynamic objects more precisely. By combining the prediction with the static object information contained in the current BEV features, accurate future BEV features can be obtained. Furthermore, we propose Future-Guided Trajectory Refinement (FGTR) module, which conducts interaction between prior trajectories (predicted from the current scene representation) and the future BEV features. This module can not only utilize future road conditions to refine trajectories, but also provides sparse spatial-temporal supervision on future BEV features to prevent world model collapse. Comprehensive experiments conducted on the nuScenes and NAVSIM datasets demonstrate that our method, namely ResWorld, achieves state-of-the-art planning performance. The code is available at https://github.com/mengtan00/ResWorld.git.
Paper Structure (17 sections, 10 equations, 5 figures, 5 tables)

This paper contains 17 sections, 10 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Comparison Between Normal World Model Framework and ResWorld Framework. Different from the normal world models that model the entire scene and implicitly optimize trajectories, Resworld uses the temporal residuals of the scene representations to represent dynamic objects for precise modeling. Meanwhile, the prior trajectories are corrected through explicit interaction with the predicted future BEV feature.
  • Figure 2: Overall Framework of ResWorld. Multi-view images at different timestamps are converted into BEV features, which are used to predict prior trajectories. On the other hand, BEV features are used to calculate temporal residuals, which are then processed by the Temporal Residual World Model to predict the future distribution of dynamic objects. Future-Guided Trajectory Refinement module further utilizes the predicted future BEV features to refine the planning results.
  • Figure 3: Structure of Temporal Residual World Model
  • Figure 4: Effect of Future-Guided Trajectory Refinement Module on alleviating world model collapse. The first row presents the future BEV features supervised using real future data, while those in the second row are predicted by the world model equipped with FGTR module. The BEV features in the second row show more diversity in spatial distribution.
  • Figure 5: Visualization of Planning Results. The object bounding boxes and lane lines on the BEV plane are rendered using the annotations. The green box denotes the ego vehicle. The areas enclosed by dashed circles indicate where collisions will occur.