Table of Contents
Fetching ...

Multi-Modal Data-Efficient 3D Scene Understanding for Autonomous Driving

Lingdong Kong, Xiang Xu, Jiawei Ren, Wenwei Zhang, Liang Pan, Kai Chen, Wei Tsang Ooi, Ziwei Liu

TL;DR

This work tackles the data annotation bottleneck in LiDAR-based 3D scene understanding for autonomous driving by proposing LaserMix++, a semi-supervised framework that leverages spatial priors and cross-modal cues. It extends the original LaserMix with camera-to-LiDAR feature distillation and language-driven guidance, enabling effective learning from unlabeled data across multiple LiDAR representations. The approach demonstrates consistent performance gains over fully supervised baselines and prior SSL methods, especially in low-label regimes, and validates robustness across diverse driving datasets. The results highlight the practical potential of multi-modal, data-efficient learning for scalable 3D perception in real-world autonomous systems.

Abstract

Efficient data utilization is crucial for advancing 3D scene understanding in autonomous driving, where reliance on heavily human-annotated LiDAR point clouds challenges fully supervised methods. Addressing this, our study extends into semi-supervised learning for LiDAR semantic segmentation, leveraging the intrinsic spatial priors of driving scenes and multi-sensor complements to augment the efficacy of unlabeled datasets. We introduce LaserMix++, an evolved framework that integrates laser beam manipulations from disparate LiDAR scans and incorporates LiDAR-camera correspondences to further assist data-efficient learning. Our framework is tailored to enhance 3D scene consistency regularization by incorporating multi-modality, including 1) multi-modal LaserMix operation for fine-grained cross-sensor interactions; 2) camera-to-LiDAR feature distillation that enhances LiDAR feature learning; and 3) language-driven knowledge guidance generating auxiliary supervisions using open-vocabulary models. The versatility of LaserMix++ enables applications across LiDAR representations, establishing it as a universally applicable solution. Our framework is rigorously validated through theoretical analysis and extensive experiments on popular driving perception datasets. Results demonstrate that LaserMix++ markedly outperforms fully supervised alternatives, achieving comparable accuracy with five times fewer annotations and significantly improving the supervised-only baselines. This substantial advancement underscores the potential of semi-supervised approaches in reducing the reliance on extensive labeled data in LiDAR-based 3D scene understanding systems.

Multi-Modal Data-Efficient 3D Scene Understanding for Autonomous Driving

TL;DR

This work tackles the data annotation bottleneck in LiDAR-based 3D scene understanding for autonomous driving by proposing LaserMix++, a semi-supervised framework that leverages spatial priors and cross-modal cues. It extends the original LaserMix with camera-to-LiDAR feature distillation and language-driven guidance, enabling effective learning from unlabeled data across multiple LiDAR representations. The approach demonstrates consistent performance gains over fully supervised baselines and prior SSL methods, especially in low-label regimes, and validates robustness across diverse driving datasets. The results highlight the practical potential of multi-modal, data-efficient learning for scalable 3D perception in real-world autonomous systems.

Abstract

Efficient data utilization is crucial for advancing 3D scene understanding in autonomous driving, where reliance on heavily human-annotated LiDAR point clouds challenges fully supervised methods. Addressing this, our study extends into semi-supervised learning for LiDAR semantic segmentation, leveraging the intrinsic spatial priors of driving scenes and multi-sensor complements to augment the efficacy of unlabeled datasets. We introduce LaserMix++, an evolved framework that integrates laser beam manipulations from disparate LiDAR scans and incorporates LiDAR-camera correspondences to further assist data-efficient learning. Our framework is tailored to enhance 3D scene consistency regularization by incorporating multi-modality, including 1) multi-modal LaserMix operation for fine-grained cross-sensor interactions; 2) camera-to-LiDAR feature distillation that enhances LiDAR feature learning; and 3) language-driven knowledge guidance generating auxiliary supervisions using open-vocabulary models. The versatility of LaserMix++ enables applications across LiDAR representations, establishing it as a universally applicable solution. Our framework is rigorously validated through theoretical analysis and extensive experiments on popular driving perception datasets. Results demonstrate that LaserMix++ markedly outperforms fully supervised alternatives, achieving comparable accuracy with five times fewer annotations and significantly improving the supervised-only baselines. This substantial advancement underscores the potential of semi-supervised approaches in reducing the reliance on extensive labeled data in LiDAR-based 3D scene understanding systems.
Paper Structure (22 sections, 16 equations, 6 figures, 9 tables)

This paper contains 22 sections, 16 equations, 6 figures, 9 tables.

Figures (6)

  • Figure 1: Motivation. (a) We observe a strong spatial prior from LiDAR-acquired driving scenes, where objects and backgrounds around the ego-vehicle have a patterned distribution on different (lower, middle, upper) laser beams. (b) The proposed laser beam mixing technique is agnostic to different LiDAR modalities and can be universally applied to existing LiDAR segmentation backbones. (c) Our approaches achieved superior performance than state-of-the-art methods Cylinder3DPolarStreamSalsaNextFIDNetPolarNet under low-data (10%, 20%, 50% labels) and high-data (full supervision) regimes on nuScenes Panoptic-nuScenes.
  • Figure 2: Laser partition example. We group LiDAR points $(p^x_i, p^y_i, p^z_i)$ whose inclinations $\phi_i$ are within the same inclination range into the same area, as depicted in color regions.
  • Figure 3: Overview of our baseline consistency regularization framework. We feed the labeled scan $x_{l}$ into the Student network to compute the supervised loss $\mathcal{L}_{\text{sup}}$ (w/ ground truth $y_{l}$). The unlabeled scan $x_{u}$ and the generated pseudo-label $y_{u}$ are mixed with $(x_{l},y_{l})$ via LaserMix (Section \ref{['sec:lasermix']}) to produce mixed data sample $(x_{\text{mix}},y_{\text{mix}})$, which is then fed into the Student network to compute the mixing loss $\mathcal{L}_{\text{mix}}$. Additionally, we encourage the consistency between the Student network and the Teacher network by computing the mean teacher loss $\mathcal{L}_{\text{mt}}$ over their predictions, where the Teacher network's parameters are updated by the exponential moving average of that of the Student network. During inference, only the Teacher network is needed, which maintains the same computational cost as the conventional LiDAR segmentation pipeline.
  • Figure 4: Data splitting strategies for data-efficient 3D scene understanding. The labeled (color) and unlabeled (gray-scale) LiDAR scans can be split via uniform (left), random (middle), and sequential (right) sampling strategies, respectively.
  • Figure 5: Qualitative assessments of state-of-the-art data-efficient 3D scene understanding models from the LiDAR bird's eye view and range view on the validation set of SemanticKITTI SemanticKITTI. To highlight the differences, the correct and incorrect predictions are painted in gray and red. Best viewed in colors and zoomed-in for additional details.
  • ...and 1 more figures