Table of Contents
Fetching ...

LiSTAR: Ray-Centric World Models for 4D LiDAR Sequences in Autonomous Driving

Pei Liu, Songtao Wang, Lang Zhang, Xingyue Peng, Yuandong Lyu, Jiaxin Deng, Songxin Lu, Weiliang Ma, Xueyang Zhang, Yifei Zhan, XianPeng Lang, Jun Ma

TL;DR

LiSTAR addresses the challenge of synthesizing realistic $4D$ LiDAR sequences by aligning generative modeling with the sensor’s geometry. The method fuses an $HCS$-based $4D$ VQ-VAE for discrete representations with the START module for ray-centric spatio-temporal modeling and a MaskSTART pipeline for controllable generation conditioned on $4D$ layouts. It achieves state-of-the-art results on nuScenes across reconstruction, prediction, and generation, including substantial gains in distributional realism and geometric fidelity. This work provides a robust foundation for realistic, controllable autonomous-driving simulation and scenario design.

Abstract

Synthesizing high-fidelity and controllable 4D LiDAR data is crucial for creating scalable simulation environments for autonomous driving. This task is inherently challenging due to the sensor's unique spherical geometry, the temporal sparsity of point clouds, and the complexity of dynamic scenes. To address these challenges, we present LiSTAR, a novel generative world model that operates directly on the sensor's native geometry. LiSTAR introduces a Hybrid-Cylindrical-Spherical (HCS) representation to preserve data fidelity by mitigating quantization artifacts common in Cartesian grids. To capture complex dynamics from sparse temporal data, it utilizes a Spatio-Temporal Attention with Ray-Centric Transformer (START) that explicitly models feature evolution along individual sensor rays for robust temporal coherence. Furthermore, for controllable synthesis, we propose a novel 4D point cloud-aligned voxel layout for conditioning and a corresponding discrete Masked Generative START (MaskSTART) framework, which learns a compact, tokenized representation of the scene, enabling efficient, high-resolution, and layout-guided compositional generation. Comprehensive experiments validate LiSTAR's state-of-the-art performance across 4D LiDAR reconstruction, prediction, and conditional generation, with substantial quantitative gains: reducing generation MMD by a massive 76%, improving reconstruction IoU by 32%, and lowering prediction L1 Med by 50%. This level of performance provides a powerful new foundation for creating realistic and controllable autonomous systems simulations. Project link: https://ocean-luna.github.io/LiSTAR.gitub.io.

LiSTAR: Ray-Centric World Models for 4D LiDAR Sequences in Autonomous Driving

TL;DR

LiSTAR addresses the challenge of synthesizing realistic LiDAR sequences by aligning generative modeling with the sensor’s geometry. The method fuses an -based VQ-VAE for discrete representations with the START module for ray-centric spatio-temporal modeling and a MaskSTART pipeline for controllable generation conditioned on layouts. It achieves state-of-the-art results on nuScenes across reconstruction, prediction, and generation, including substantial gains in distributional realism and geometric fidelity. This work provides a robust foundation for realistic, controllable autonomous-driving simulation and scenario design.

Abstract

Synthesizing high-fidelity and controllable 4D LiDAR data is crucial for creating scalable simulation environments for autonomous driving. This task is inherently challenging due to the sensor's unique spherical geometry, the temporal sparsity of point clouds, and the complexity of dynamic scenes. To address these challenges, we present LiSTAR, a novel generative world model that operates directly on the sensor's native geometry. LiSTAR introduces a Hybrid-Cylindrical-Spherical (HCS) representation to preserve data fidelity by mitigating quantization artifacts common in Cartesian grids. To capture complex dynamics from sparse temporal data, it utilizes a Spatio-Temporal Attention with Ray-Centric Transformer (START) that explicitly models feature evolution along individual sensor rays for robust temporal coherence. Furthermore, for controllable synthesis, we propose a novel 4D point cloud-aligned voxel layout for conditioning and a corresponding discrete Masked Generative START (MaskSTART) framework, which learns a compact, tokenized representation of the scene, enabling efficient, high-resolution, and layout-guided compositional generation. Comprehensive experiments validate LiSTAR's state-of-the-art performance across 4D LiDAR reconstruction, prediction, and conditional generation, with substantial quantitative gains: reducing generation MMD by a massive 76%, improving reconstruction IoU by 32%, and lowering prediction L1 Med by 50%. This level of performance provides a powerful new foundation for creating realistic and controllable autonomous systems simulations. Project link: https://ocean-luna.github.io/LiSTAR.gitub.io.

Paper Structure

This paper contains 33 sections, 10 equations, 8 figures, 5 tables, 6 algorithms.

Figures (8)

  • Figure 1: Cartesian vs. HCS coordinate for LiDAR scene representation. Cartesian coordinate partitions space into uniform, axis‑aligned cubes, ignoring the native ray geometry of LiDAR. HCS coordinates divides space into angular–radial cells centered at the sensor origin, aligning with LiDAR’s ray-based sampling pattern and preserving range-dependent resolution.
  • Figure 2: Illustration of the LiSTAR framework for 4D LiDAR sequence reconstruction and generation. The framework begins by voxelizing LiDAR point clouds into a spherical coordinate representation, which is downsampled and processed by multiple START modules in the encoder to extract semantic-rich latent tokens. The decoder reconstructs detailed 4D sequences by up-sampling tokens with additional START modules. The MaskSTART component facilitates controllable and diverse generation by predicting masked tokens using a bidirectional transformer, conditioned on 4D point cloud-aligned voxel layouts. This design captures spatiotemporal dependencies while preserving fine-grained geometric details.
  • Figure 3: An illustration of our START module. It processes a 4D feature map of shape $[B, D, H, W, C]$, where $D$ is the temporal dimension. It is composed of two key components: (1) a CSTA block that operates on windowed features to efficiently model temporal dependencies, and (2) an SRA block that processes features reshaped to $[B*D, H, W, C]$ to capture spatial correlations along the ray dimension.
  • Figure 4: Qualitative comparison of point cloud reconstruction. The visualization overlays predictions with the ground truth: magenta (correct intersection), green (missed ground truth), and blue (artifacts). Our method consistently yields more complete reconstructions (denser magenta) with significantly fewer artifacts (less blue), demonstrating superior accuracy.
  • Figure 5: Qualitative results for prediction and generation. We compare our method with OpenDWM against the ground truth for future horizons up to 2s. Our method consistently produces sharper and more accurate results for both static background and dynamic objects (highlighted) compared to the baseline. The baseline's predictions and generations degrade significantly over time, losing structural detail.
  • ...and 3 more figures