LiMoE: Mixture of LiDAR Representation Learners from Automotive Scenes
Xiang Xu, Lingdong Kong, Hui Shuai, Liang Pan, Ziwei Liu, Qingshan Liu
TL;DR
LiMoE introduces a novel mixture-of-experts framework for LiDAR representation learning, unifying range images, sparse voxels, and raw points across a three-stage pipeline: Image-to-LiDAR pretraining, Contrastive Mixture Learning (CML), and Semantic Mixture Supervision (SMS). By dynamically selecting and combining attributes from multiple representations, LiMoE achieves superior 3D scene understanding and robustness, validated across 11 large-scale datasets and multiple downstream tasks including semantic segmentation and 3D object detection. The approach demonstrates that representation diversity, when guided by an MoE gate, yields consistent performance gains over single-representation baselines, with detailed ablations and qualitative analyses supporting the effectiveness of CML and SMS. The work provides extensive implementation details and public resources, signaling practical feasibility for scalable LiDAR perception systems in autonomous driving.
Abstract
LiDAR data pretraining offers a promising approach to leveraging large-scale, readily available datasets for enhanced data utilization. However, existing methods predominantly focus on sparse voxel representation, overlooking the complementary attributes provided by other LiDAR representations. In this work, we propose LiMoE, a framework that integrates the Mixture of Experts (MoE) paradigm into LiDAR data representation learning to synergistically combine multiple representations, such as range images, sparse voxels, and raw points. Our approach consists of three stages: i) Image-to-LiDAR Pretraining, which transfers prior knowledge from images to point clouds across different representations; ii) Contrastive Mixture Learning (CML), which uses MoE to adaptively activate relevant attributes from each representation and distills these mixed features into a unified 3D network; iii) Semantic Mixture Supervision (SMS), which combines semantic logits from multiple representations to boost downstream segmentation performance. Extensive experiments across eleven large-scale LiDAR datasets demonstrate our effectiveness and superiority. The code has been made publicly accessible.
