Table of Contents
Fetching ...

Uni3D-MoE: Scalable Multimodal 3D Scene Understanding via Mixture of Experts

Yue Zhang, Yingzhao Jian, Hehe Fan, Yi Yang, Roger Zimmermann

TL;DR

Uni3D-MoE tackles the challenge of scalable, adaptive 3D multimodal understanding by integrating five modalities (RGB/RGBD, BEV, point clouds, and voxels) through modality-specific encoders and adapters, fused via a sparse MoE-based LLM with token-level routing. The model introduces a two-stage training regimen and a learnable soft router that assigns each modality token to a subset of experts, guided by a sparsity-aware loss to balance expert utilization. Empirical results on ScanNet-derived benchmarks show notable gains across 3D question answering, visual grounding, and dense captioning, with MoE providing additional improvements over strong baselines. The approach demonstrates that modality-aware, expert-driven fusion yields more complete scene representations and task-specific reasoning, with practical implications for embodied AI and robotics.

Abstract

Recent advancements in multimodal large language models (MLLMs) have demonstrated considerable potential for comprehensive 3D scene understanding. However, existing approaches typically utilize only one or a limited subset of 3D modalities, resulting in incomplete representations of 3D scenes and reduced interpretive accuracy. Furthermore, different types of queries inherently depend on distinct modalities, indicating that uniform processing of all modality tokens may fail to effectively capture query-specific context. To address these challenges, we propose Uni3D-MoE, a sparse Mixture-of-Experts (MoE)-based 3D MLLM designed to enable adaptive 3D multimodal fusion. Specifically, Uni3D-MoE integrates a comprehensive set of 3D modalities, including multi-view RGB and depth images, bird's-eye-view (BEV) maps, point clouds, and voxel representations. At its core, our framework employs a learnable routing mechanism within the sparse MoE-based large language model, dynamically selecting appropriate experts at the token level. Each expert specializes in processing multimodal tokens based on learned modality preferences, thus facilitating flexible collaboration tailored to diverse task-specific requirements. Extensive evaluations on standard 3D scene understanding benchmarks and specialized datasets demonstrate the efficacy of Uni3D-MoE.

Uni3D-MoE: Scalable Multimodal 3D Scene Understanding via Mixture of Experts

TL;DR

Uni3D-MoE tackles the challenge of scalable, adaptive 3D multimodal understanding by integrating five modalities (RGB/RGBD, BEV, point clouds, and voxels) through modality-specific encoders and adapters, fused via a sparse MoE-based LLM with token-level routing. The model introduces a two-stage training regimen and a learnable soft router that assigns each modality token to a subset of experts, guided by a sparsity-aware loss to balance expert utilization. Empirical results on ScanNet-derived benchmarks show notable gains across 3D question answering, visual grounding, and dense captioning, with MoE providing additional improvements over strong baselines. The approach demonstrates that modality-aware, expert-driven fusion yields more complete scene representations and task-specific reasoning, with practical implications for embodied AI and robotics.

Abstract

Recent advancements in multimodal large language models (MLLMs) have demonstrated considerable potential for comprehensive 3D scene understanding. However, existing approaches typically utilize only one or a limited subset of 3D modalities, resulting in incomplete representations of 3D scenes and reduced interpretive accuracy. Furthermore, different types of queries inherently depend on distinct modalities, indicating that uniform processing of all modality tokens may fail to effectively capture query-specific context. To address these challenges, we propose Uni3D-MoE, a sparse Mixture-of-Experts (MoE)-based 3D MLLM designed to enable adaptive 3D multimodal fusion. Specifically, Uni3D-MoE integrates a comprehensive set of 3D modalities, including multi-view RGB and depth images, bird's-eye-view (BEV) maps, point clouds, and voxel representations. At its core, our framework employs a learnable routing mechanism within the sparse MoE-based large language model, dynamically selecting appropriate experts at the token level. Each expert specializes in processing multimodal tokens based on learned modality preferences, thus facilitating flexible collaboration tailored to diverse task-specific requirements. Extensive evaluations on standard 3D scene understanding benchmarks and specialized datasets demonstrate the efficacy of Uni3D-MoE.

Paper Structure

This paper contains 27 sections, 5 equations, 14 figures, 13 tables, 1 algorithm.

Figures (14)

  • Figure 1: Challenges in 3D scene understanding. (1) Limited modalities may not provide enough scene information. (2) Different question types have varying dependencies on modalities. Existing methods typically treat all modality tokens equally, without adapting to question-specific modality preferences.
  • Figure 2: Overview of our method. Uni3D-MoE covers major input modalities of 3D scenes, including RGB/depth images, BEV maps, point clouds, and voxels. To ensure informative spatial coverage, multi-view images are selected using a Maximum Voxel Coverage Sampling (MVCS) algorithm. Each modality is encoded by a modality-specific encoder and aligned via lightweight adapters. The resulting 3D visual tokens, together with text tokens, are then fed into a sparse Mixture-of-Expert (MoE)-based LLM. A learnable soft router dynamically assigns each token to a subset of suitable experts for specialized processing. The model is optimized with a joint objective combining cross-entropy loss $l_{ce}$ and a sparsity-aware expert balancing loss $l_{moe}$.
  • Figure 3: Overview of two-stage training strategy.
  • Figure 4: Token-to-expert routing across MoE layers. The first two rows show the modality token distribution across different experts at various MoE layers. The higher proportion of RGB, RGBD, and BEV tokens is attributed to their larger token counts, while expert-wise distributions reveal each expert's modality preferences. The third row presents the expert assignment distribution for each modality, indicating how each modality tends to select different experts throughout the MoE layers.
  • Figure 5: Modality-expert routing preferences across different question types. Line thickness indicates normalized token routing proportions. Preferred modality-expert routes for each query type are highlighted in color; others are shown in gray.
  • ...and 9 more figures