Table of Contents
Fetching ...

LaGen: Towards Autoregressive LiDAR Scene Generation

Sizhuo Zhou, Xiaosong Jia, Fanrui Zhang, Junjie Li, Juyong Zhang, Yukang Feng, Jianwen Sun, Songbur Wong, Junqi You, Junchi Yan

TL;DR

LaGen addresses the need for long-horizon, interactive LiDAR scene generation from a single frame by introducing a frame-by-frame autoregressive framework based on a Latent Diffusion Model. The method combines a range-image representation, a multi-condition diffusion generator, and two key modules—Scene Decoupling Estimation and Noise Modulation—to achieve strong spatiotemporal coherence. It supports interactive edits at the object level and demonstrates superior performance over state-of-the-art LiDAR generation and prediction models on nuScenes, with a dedicated long-horizon benchmark. The work enables improved closed-loop simulation and world modeling for autonomous driving by integrating per-step decisions into future LiDAR predictions.

Abstract

Generative world models for autonomous driving (AD) have become a trending topic. Unlike the widely studied image modality, in this work we explore generative world models for LiDAR data. Existing generation methods for LiDAR data only support single frame generation, while existing prediction approaches require multiple frames of historical input and can only deterministically predict multiple frames at once, lacking interactivity. Both paradigms fail to support long-horizon interactive generation. To this end, we introduce LaGen, which to the best of our knowledge is the first framework capable of frame-by-frame autoregressive generation of long-horizon LiDAR scenes. LaGen is able to take a single-frame LiDAR input as a starting point and effectively utilize bounding box information as conditions to generate high-fidelity 4D scene point clouds. In addition, we introduce a scene decoupling estimation module to enhance the model's interactive generation capability for object-level content, as well as a noise modulation module to mitigate error accumulation during long-horizon generation. We construct a protocol based on nuScenes for evaluating long-horizon LiDAR scene generation. Experimental results comprehensively demonstrate LaGen outperforms state-of-the-art LiDAR generation and prediction models, especially on the later frames.

LaGen: Towards Autoregressive LiDAR Scene Generation

TL;DR

LaGen addresses the need for long-horizon, interactive LiDAR scene generation from a single frame by introducing a frame-by-frame autoregressive framework based on a Latent Diffusion Model. The method combines a range-image representation, a multi-condition diffusion generator, and two key modules—Scene Decoupling Estimation and Noise Modulation—to achieve strong spatiotemporal coherence. It supports interactive edits at the object level and demonstrates superior performance over state-of-the-art LiDAR generation and prediction models on nuScenes, with a dedicated long-horizon benchmark. The work enables improved closed-loop simulation and world modeling for autonomous driving by integrating per-step decisions into future LiDAR predictions.

Abstract

Generative world models for autonomous driving (AD) have become a trending topic. Unlike the widely studied image modality, in this work we explore generative world models for LiDAR data. Existing generation methods for LiDAR data only support single frame generation, while existing prediction approaches require multiple frames of historical input and can only deterministically predict multiple frames at once, lacking interactivity. Both paradigms fail to support long-horizon interactive generation. To this end, we introduce LaGen, which to the best of our knowledge is the first framework capable of frame-by-frame autoregressive generation of long-horizon LiDAR scenes. LaGen is able to take a single-frame LiDAR input as a starting point and effectively utilize bounding box information as conditions to generate high-fidelity 4D scene point clouds. In addition, we introduce a scene decoupling estimation module to enhance the model's interactive generation capability for object-level content, as well as a noise modulation module to mitigate error accumulation during long-horizon generation. We construct a protocol based on nuScenes for evaluating long-horizon LiDAR scene generation. Experimental results comprehensively demonstrate LaGen outperforms state-of-the-art LiDAR generation and prediction models, especially on the later frames.

Paper Structure

This paper contains 29 sections, 11 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: This work introduces LaGen, a novel 4D LiDAR scene generation framework that can autoregressively generate (a) long-horizon autonomous driving scenarios based solely on single-frame input. It is capable of generating high-fidelity LiDAR data consistent with the real world; (b) illustrates the visualization results of object-level layouts in a generated scene from different viewpoints.
  • Figure 2: Overview of the LaGen framework. It first obtains estimates of the current frame's foreground and background point clouds via the Scene Decoupling Estimation (SDE) module. Subsequently, all three-dimensional information is projected to the two-dimensional range image space via spherical mapping. These range-view representations are then processed by a pretrained VAE and a noise modulation (NM) module, before being passed to the U-net network to generate the LiDAR scene for the current frame. Afterward, the current generated scene is iteratively used as the input to the model, thereby completing the autoregressive process.
  • Figure 3: The schematic diagram of the LiDAR generator architecture. It mainly comprises a VAE and a U-Net. Additionally, it includes the following modules: a) Spherical projection module, which provides a compact representation of 3D data; b) SDE module, designed to enhance the model's ability to learn detailed features; c) NM module, which mitigates error accumulation during long-horizon generation by introducing noise modulation.
  • Figure 4: The schematic diagram of scene decoupling estimation module. This module performs a coarse estimation of current foreground and background based on the previous frame’s scene and the object-level bounding box information of the current frame.
  • Figure 5: The schematic diagram of noise modulation module. This module adds noise to the latent features containing previous frame information at each autoregressive step.
  • ...and 2 more figures