DriveLiDAR4D: Sequential and Controllable LiDAR Scene Generation for Autonomous Driving
Kaiwen Cai, Xinze Liu, Xia Zhou, Hengtong Hu, Jie Xiang, Luyao Zhang, Xueyang Zhang, Kun Zhan, Yifei Zhan, Xianpeng Lang
TL;DR
The paper addresses the need for realistic, sequential LiDAR scene generation with fine-grained foreground and background control for autonomous driving. It introduces DriveLiDAR4D, a multimodal conditioning framework paired with LiDAR4DNet, an equirectangular spatial-temporal diffusion model featuring EST-Conv and EST-Trans to ensure temporal and spatial coherence. The approach leverages road sketches, scene captions, and object priors to guide generation, achieving state-of-the-art FRD, MMD, JSD, and FVD on nuScenes and KITTI-360, and improving downstream 3D object detection fidelity. These advances enable end-to-end, controllable, and realistic 4D LiDAR scene synthesis that can enhance simulation, evaluation, and safety analysis in autonomous driving systems.
Abstract
The generation of realistic LiDAR point clouds plays a crucial role in the development and evaluation of autonomous driving systems. Although recent methods for 3D LiDAR point cloud generation have shown significant improvements, they still face notable limitations, including the lack of sequential generation capabilities and the inability to produce accurately positioned foreground objects and realistic backgrounds. These shortcomings hinder their practical applicability. In this paper, we introduce DriveLiDAR4D, a novel LiDAR generation pipeline consisting of multimodal conditions and a novel sequential noise prediction model LiDAR4DNet, capable of producing temporally consistent LiDAR scenes with highly controllable foreground objects and realistic backgrounds. To the best of our knowledge, this is the first work to address the sequential generation of LiDAR scenes with full scene manipulation capability in an end-to-end manner. We evaluated DriveLiDAR4D on the nuScenes and KITTI datasets, where we achieved an FRD score of 743.13 and an FVD score of 16.96 on the nuScenes dataset, surpassing the current state-of-the-art (SOTA) method, UniScene, with an performance boost of 37.2% in FRD and 24.1% in FVD, respectively.
