Table of Contents
Fetching ...

LiMoE: Mixture of LiDAR Representation Learners from Automotive Scenes

Xiang Xu, Lingdong Kong, Hui Shuai, Liang Pan, Ziwei Liu, Qingshan Liu

TL;DR

LiMoE introduces a novel mixture-of-experts framework for LiDAR representation learning, unifying range images, sparse voxels, and raw points across a three-stage pipeline: Image-to-LiDAR pretraining, Contrastive Mixture Learning (CML), and Semantic Mixture Supervision (SMS). By dynamically selecting and combining attributes from multiple representations, LiMoE achieves superior 3D scene understanding and robustness, validated across 11 large-scale datasets and multiple downstream tasks including semantic segmentation and 3D object detection. The approach demonstrates that representation diversity, when guided by an MoE gate, yields consistent performance gains over single-representation baselines, with detailed ablations and qualitative analyses supporting the effectiveness of CML and SMS. The work provides extensive implementation details and public resources, signaling practical feasibility for scalable LiDAR perception systems in autonomous driving.

Abstract

LiDAR data pretraining offers a promising approach to leveraging large-scale, readily available datasets for enhanced data utilization. However, existing methods predominantly focus on sparse voxel representation, overlooking the complementary attributes provided by other LiDAR representations. In this work, we propose LiMoE, a framework that integrates the Mixture of Experts (MoE) paradigm into LiDAR data representation learning to synergistically combine multiple representations, such as range images, sparse voxels, and raw points. Our approach consists of three stages: i) Image-to-LiDAR Pretraining, which transfers prior knowledge from images to point clouds across different representations; ii) Contrastive Mixture Learning (CML), which uses MoE to adaptively activate relevant attributes from each representation and distills these mixed features into a unified 3D network; iii) Semantic Mixture Supervision (SMS), which combines semantic logits from multiple representations to boost downstream segmentation performance. Extensive experiments across eleven large-scale LiDAR datasets demonstrate our effectiveness and superiority. The code has been made publicly accessible.

LiMoE: Mixture of LiDAR Representation Learners from Automotive Scenes

TL;DR

LiMoE introduces a novel mixture-of-experts framework for LiDAR representation learning, unifying range images, sparse voxels, and raw points across a three-stage pipeline: Image-to-LiDAR pretraining, Contrastive Mixture Learning (CML), and Semantic Mixture Supervision (SMS). By dynamically selecting and combining attributes from multiple representations, LiMoE achieves superior 3D scene understanding and robustness, validated across 11 large-scale datasets and multiple downstream tasks including semantic segmentation and 3D object detection. The approach demonstrates that representation diversity, when guided by an MoE gate, yields consistent performance gains over single-representation baselines, with detailed ablations and qualitative analyses supporting the effectiveness of CML and SMS. The work provides extensive implementation details and public resources, signaling practical feasibility for scalable LiDAR perception systems in autonomous driving.

Abstract

LiDAR data pretraining offers a promising approach to leveraging large-scale, readily available datasets for enhanced data utilization. However, existing methods predominantly focus on sparse voxel representation, overlooking the complementary attributes provided by other LiDAR representations. In this work, we propose LiMoE, a framework that integrates the Mixture of Experts (MoE) paradigm into LiDAR data representation learning to synergistically combine multiple representations, such as range images, sparse voxels, and raw points. Our approach consists of three stages: i) Image-to-LiDAR Pretraining, which transfers prior knowledge from images to point clouds across different representations; ii) Contrastive Mixture Learning (CML), which uses MoE to adaptively activate relevant attributes from each representation and distills these mixed features into a unified 3D network; iii) Semantic Mixture Supervision (SMS), which combines semantic logits from multiple representations to boost downstream segmentation performance. Extensive experiments across eleven large-scale LiDAR datasets demonstrate our effectiveness and superiority. The code has been made publicly accessible.
Paper Structure (32 sections, 11 equations, 17 figures, 10 tables, 2 algorithms)

This paper contains 32 sections, 11 equations, 17 figures, 10 tables, 2 algorithms.

Figures (17)

  • Figure 1: Illustration of the proposed mixture of LiDAR representation learning (LiMoE) design. We observe unique patterns of each LiDAR representation (range, voxel, and point) in image-to-LiDAR data pretraining. Our framework aims to integrate distinct attributes from different LiDAR representations into a unified feature space, enabling enhanced 3D scene understanding.
  • Figure 2: Overview of the LiMoE framework. Our design consists of three stages: (#1) The image-to-LiDAR pretraining transfers knowledge from images to various LiDAR representations (cf. \ref{['sec:stage1']}); (#2) The contrastive mixture learning (CML) integrates the MoE framework to mix data attributes into a unified representation for pretraining (cf. \ref{['sec:stage2']}); and (#3) The semantic mixture supervision (SMS) fuses semantic logits from multiple representations to further enhance performance across different downstream tasks (cf. \ref{['sec:stage3']}).
  • Figure 3: Visual interpretations of the expert activation paths in CML. The experts are #1 range view, #2 voxel, and #3 point, respectively.
  • Figure 4: Cosine similarity between learned features of a query point (denoted as the red dot) and: (1) the features of the image of the same scene (the first row); and (2) the features of the LiDAR points projected onto the image (the second row). Best viewed in colors.
  • Figure 5: Ablation study on distributions of expert loadings in CML. The distributions are based on (a) LiDAR beam numbers and (b) distances. The three experts are #1 range view, #2 voxel, and #3 point, respectively.
  • ...and 12 more figures