Table of Contents
Fetching ...

Kinematics-Aware Latent World Models for Data-Efficient Autonomous Driving

Jiazhuo Li, Linjiang Cao, Qi Liu, Xi Xiong

TL;DR

This work proposes a kinematics-aware latent world model framework for autonomous driving and suggests that integrating kinematic grounding into RSSM-based world models provides a scalable and physically grounded paradigm for autonomous driving policy learning.

Abstract

Data-efficient learning remains a central challenge in autonomous driving due to the high cost and safety risks of large-scale real-world interaction. Although world-model-based reinforcement learning enables policy optimization through latent imagination, existing approaches often lack explicit mechanisms to encode spatial and kinematic structure essential for driving tasks. In this work, we build upon the Recurrent State-Space Model (RSSM) and propose a kinematics-aware latent world model framework for autonomous driving. Vehicle kinematic information is incorporated into the observation encoder to ground latent transitions in physically meaningful motion dynamics, while geometry-aware supervision regularizes the RSSM latent state to capture task-relevant spatial structure beyond pixel reconstruction. The resulting structured latent dynamics improve long-horizon imagination fidelity and stabilize policy optimization. Experiments in a driving simulation benchmark demonstrate consistent gains over both model-free and pixel-based world-model baselines in terms of sample efficiency and driving performance. Ablation studies further verify that the proposed design enhances spatial representation quality within the latent space. These results suggest that integrating kinematic grounding into RSSM-based world models provides a scalable and physically grounded paradigm for autonomous driving policy learning.

Kinematics-Aware Latent World Models for Data-Efficient Autonomous Driving

TL;DR

This work proposes a kinematics-aware latent world model framework for autonomous driving and suggests that integrating kinematic grounding into RSSM-based world models provides a scalable and physically grounded paradigm for autonomous driving policy learning.

Abstract

Data-efficient learning remains a central challenge in autonomous driving due to the high cost and safety risks of large-scale real-world interaction. Although world-model-based reinforcement learning enables policy optimization through latent imagination, existing approaches often lack explicit mechanisms to encode spatial and kinematic structure essential for driving tasks. In this work, we build upon the Recurrent State-Space Model (RSSM) and propose a kinematics-aware latent world model framework for autonomous driving. Vehicle kinematic information is incorporated into the observation encoder to ground latent transitions in physically meaningful motion dynamics, while geometry-aware supervision regularizes the RSSM latent state to capture task-relevant spatial structure beyond pixel reconstruction. The resulting structured latent dynamics improve long-horizon imagination fidelity and stabilize policy optimization. Experiments in a driving simulation benchmark demonstrate consistent gains over both model-free and pixel-based world-model baselines in terms of sample efficiency and driving performance. Ablation studies further verify that the proposed design enhances spatial representation quality within the latent space. These results suggest that integrating kinematic grounding into RSSM-based world models provides a scalable and physically grounded paradigm for autonomous driving policy learning.
Paper Structure (15 sections, 17 equations, 5 figures, 1 table, 1 algorithm)

This paper contains 15 sections, 17 equations, 5 figures, 1 table, 1 algorithm.

Figures (5)

  • Figure 1: World Model Learning. The network encodes multi-modal inputs into latent states and decodes multiple task-specific outputs including reconstruction, prediction, and driving-aware supervision signals.
  • Figure 2: Actor-Critic Learning. Imagined trajectories generated by the world model enable policy optimization in latent space. The actor predicts actions $\pi(a \mid \phi)$ while the critic estimates values $V(\phi)$ via $\lambda$-returns, allowing gradient-based updates without real environment interaction.
  • Figure 3: A comparison between our model with PPO. The solid lines represent the averaged return, while the shaded area indicates the variability around the mean.
  • Figure 4: Training curves of model variants in the ablation study. ImgOnly (green) uses images input alone; Img+Head (blue) adds lane and neighbor supervision heads; Img+Head+Phys (orange) further incorporates vehicle physics as input.
  • Figure 5: Comparison of imagination quality across model variants. Top row: ImgOnly generates physically inconsistent rollouts with blurred vehicle positions (left) and confused lane markings (right). Bottom row: Img+Head+Phys produces stable, physically plausible predictions with correct semantic preservation of surrounding vehicles and lane markings.