Table of Contents
Fetching ...

Text2LiDAR: Text-guided LiDAR Point Cloud Generation via Equirectangular Transformer

Yang Wu, Kaihua Zhang, Jianjun Qian, Jin Xie, Jian Yang

TL;DR

This work introduces Text2LiDAR, a diffusion-based framework that generates text-guided LiDAR point clouds by converting scans into equirectangular representations. It introduces an equirectangular transformer with EA/REA attention, a global-to-focused control-signal embedding injector (CEI), and a frequency modulator (FM) to preserve high-frequency detail, together enabling accurate, text-controllable generation. To support the field, the authors assemble nuLiDARtext with 34,149 text-LiDAR pairs across 850 nuScenes scenes, enabling reliable text conditioning. Evaluations on KITTI-360 and nuScenes demonstrate superior performance in uncontrolled generation, densification, and, notably, text-controlled generation, underscoring practical potential for data augmentation and scenario customization in autonomous systems.

Abstract

The complex traffic environment and various weather conditions make the collection of LiDAR data expensive and challenging. Achieving high-quality and controllable LiDAR data generation is urgently needed, controlling with text is a common practice, but there is little research in this field. To this end, we propose Text2LiDAR, the first efficient, diverse, and text-controllable LiDAR data generation model. Specifically, we design an equirectangular transformer architecture, utilizing the designed equirectangular attention to capture LiDAR features in a manner with data characteristics. Then, we design a control-signal embedding injector to efficiently integrate control signals through the global-to-focused attention mechanism. Additionally, we devise a frequency modulator to assist the model in recovering high-frequency details, ensuring the clarity of the generated point cloud. To foster development in the field and optimize text-controlled generation performance, we construct nuLiDARtext which offers diverse text descriptors for 34,149 LiDAR point clouds from 850 scenes. Experiments on uncontrolled and text-controlled generation in various forms on KITTI-360 and nuScenes datasets demonstrate the superiority of our approach.

Text2LiDAR: Text-guided LiDAR Point Cloud Generation via Equirectangular Transformer

TL;DR

This work introduces Text2LiDAR, a diffusion-based framework that generates text-guided LiDAR point clouds by converting scans into equirectangular representations. It introduces an equirectangular transformer with EA/REA attention, a global-to-focused control-signal embedding injector (CEI), and a frequency modulator (FM) to preserve high-frequency detail, together enabling accurate, text-controllable generation. To support the field, the authors assemble nuLiDARtext with 34,149 text-LiDAR pairs across 850 nuScenes scenes, enabling reliable text conditioning. Evaluations on KITTI-360 and nuScenes demonstrate superior performance in uncontrolled generation, densification, and, notably, text-controlled generation, underscoring practical potential for data augmentation and scenario customization in autonomous systems.

Abstract

The complex traffic environment and various weather conditions make the collection of LiDAR data expensive and challenging. Achieving high-quality and controllable LiDAR data generation is urgently needed, controlling with text is a common practice, but there is little research in this field. To this end, we propose Text2LiDAR, the first efficient, diverse, and text-controllable LiDAR data generation model. Specifically, we design an equirectangular transformer architecture, utilizing the designed equirectangular attention to capture LiDAR features in a manner with data characteristics. Then, we design a control-signal embedding injector to efficiently integrate control signals through the global-to-focused attention mechanism. Additionally, we devise a frequency modulator to assist the model in recovering high-frequency details, ensuring the clarity of the generated point cloud. To foster development in the field and optimize text-controlled generation performance, we construct nuLiDARtext which offers diverse text descriptors for 34,149 LiDAR point clouds from 850 scenes. Experiments on uncontrolled and text-controlled generation in various forms on KITTI-360 and nuScenes datasets demonstrate the superiority of our approach.
Paper Structure (14 sections, 6 equations, 7 figures, 4 tables)

This paper contains 14 sections, 6 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Schematic comparison of our Text2LiDAR and the existing diffusion-based generation framework zyrianov2022learningnakashima2023lidar without text guidance.
  • Figure 2: The architecture of the designed Text2LiDAR, where the designed equirectangular transformer is composed of stacked EA (encoding stage) and REA (decoding stage). The feature sequence will start interacting with the control signal at the 4th layer and be fed into a 4-layer decoder composed of REA. During decoding, the feature sequence continuously fuses the control signal through CEI. Finally, after frequency modulation, we can get the predicted noise.
  • Figure 3: The architecture of the frequency modulator.
  • Figure 4: The number of occurrences of text in 850 scenes.
  • Figure 5: Comparison with LiDARGen and R2DM on uncontrolled generation.
  • ...and 2 more figures