Table of Contents
Fetching ...

MoE3D: Mixture of Experts meets Multi-Modal 3D Understanding

Yu Li, Yuenan Hou, Yingmei Wei, Xinge Zhu, Yuexin Ma, Wenqi Shao, Yanming Guo

TL;DR

MoE3D introduces a mixture-of-experts framework to multi-modal 3D understanding by deploying specialized experts for different modalities and interactions, guided by a Top-1 gating mechanism. The MoE Superpoint Transformer (MEST) integrates an Information Aggregation Module to fuse superpoint, prompt, and segmentation cues, with a progressive pre-training regime and instruction-tuning of a large language model via LoRA. Across four benchmarks, MoE3D achieves state-of-the-art or competitive results on 3D referring segmentation and 3D QA tasks, while maintaining efficiency through sparse routing. The work demonstrates the value of expert specialization for adaptable, unified multi-modal reasoning in complex 3D scenes.

Abstract

Multi-modal 3D understanding is a fundamental task in computer vision. Previous multi-modal fusion methods typically employ a single, dense fusion network, struggling to handle the significant heterogeneity and complexity across modalities, leading to suboptimal performance. In this paper, we propose MoE3D, which integrates Mixture of Experts (MoE) into the multi-modal learning framework. The core is that we deploy a set of specialized "expert" networks, each adept at processing a specific modality or a mode of cross-modal interaction. Specifically, the MoE-based transformer is designed to better utilize the complementary information hidden in the visual features. Information aggregation module is put forward to further enhance the fusion performance. Top-1 gating is employed to make one expert process features with expert groups, ensuring high efficiency. We further propose a progressive pre-training strategy to better leverage the semantic and 2D prior, thus equipping the network with good initialization. Our MoE3D achieves competitive performance across four prevalent 3D understanding tasks. Notably, our MoE3D surpasses the top-performing counterpart by 6.1 mIoU on Multi3DRefer.

MoE3D: Mixture of Experts meets Multi-Modal 3D Understanding

TL;DR

MoE3D introduces a mixture-of-experts framework to multi-modal 3D understanding by deploying specialized experts for different modalities and interactions, guided by a Top-1 gating mechanism. The MoE Superpoint Transformer (MEST) integrates an Information Aggregation Module to fuse superpoint, prompt, and segmentation cues, with a progressive pre-training regime and instruction-tuning of a large language model via LoRA. Across four benchmarks, MoE3D achieves state-of-the-art or competitive results on 3D referring segmentation and 3D QA tasks, while maintaining efficiency through sparse routing. The work demonstrates the value of expert specialization for adaptable, unified multi-modal reasoning in complex 3D scenes.

Abstract

Multi-modal 3D understanding is a fundamental task in computer vision. Previous multi-modal fusion methods typically employ a single, dense fusion network, struggling to handle the significant heterogeneity and complexity across modalities, leading to suboptimal performance. In this paper, we propose MoE3D, which integrates Mixture of Experts (MoE) into the multi-modal learning framework. The core is that we deploy a set of specialized "expert" networks, each adept at processing a specific modality or a mode of cross-modal interaction. Specifically, the MoE-based transformer is designed to better utilize the complementary information hidden in the visual features. Information aggregation module is put forward to further enhance the fusion performance. Top-1 gating is employed to make one expert process features with expert groups, ensuring high efficiency. We further propose a progressive pre-training strategy to better leverage the semantic and 2D prior, thus equipping the network with good initialization. Our MoE3D achieves competitive performance across four prevalent 3D understanding tasks. Notably, our MoE3D surpasses the top-performing counterpart by 6.1 mIoU on Multi3DRefer.

Paper Structure

This paper contains 23 sections, 13 equations, 6 figures, 10 tables.

Figures (6)

  • Figure 1: (a) Schematic overview of our MoE3D . (b) Competitive performance of our MoE3D against contemporary algorithms on four popular 3D tasks.
  • Figure 2: Framework overview of our MoE3D . The colored point cloud is fed to the multi-modal feature extractor and produce the visual features. Visual prompt and the sampled visual features are sent to the prompt aggregator, generating the enhanced prompt features. Visual features, together with the prompt features, are sent to the MoE superpoint transformer (MEST), generating visual tokens. The produced visual tokens and the text embedding are fed to the large language model, yielding the ultimate output. For referring segmentation, the predicted masks are subsequently produced via the MEST module. The language model is partially finetuned using LoRA Hu2021LoRALA.
  • Figure 3: Overview of our MoE Superpoint Transformer. It consists of vanilla Transformer blocks and MoE blocks, where the latter are inserted in an interleaved way. Each MoE block contains four experts. In the feedforward operation, only one expert is activated by the gating network and used to process the input features, ensuring high effiency.
  • Figure 4: Visual results on the referring segmentation task. (a) Predicted mask according to the textual referring expression. (b) The four experts with different color exhibit distinct modality preferences. (c) Superpoint labels with limited boundary accuracy for training. (d) Raw point cloud of the corresponding 3D scene.
  • Figure 5: Qualitative visualization of expert specialization. (a) Raw point cloud. (b) Experts activation maps produced by our MoE3D , where each color corresponds to the dominant expert assigned to each point. (c) Superpoint labels used for training.
  • ...and 1 more figures