Table of Contents
Fetching ...

Point-MoE: Large-Scale Multi-Dataset Training with Mixture-of-Experts for 3D Semantic Segmentation

Xuweiyi Chen, Wentao Zhou, Aruni RoyChowdhury, Zezhou Cheng

TL;DR

This work introduces Point-MoE, a Mixture-of-Experts design that expands model capacity through sparsely activated expert MLPs and a lightweight top-$k$ router, allowing tokens to select specialized experts without requiring dataset supervision.

Abstract

While massively scaling both data and models have become central in NLP and 2D vision, their benefits for 3D point cloud understanding remain limited. We study the initial step of scaling 3D point cloud understanding under a realistic regime: large-scale multi-dataset joint training for 3D semantic segmentation, with no dataset labels available at training or inference time. Point clouds arise from a wide range of sensors (e.g., depth cameras, LiDAR) and scenes (\eg, indoor, outdoor), yielding heterogeneous scanning patterns, sampling densities, and semantic biases; naively mixing such datasets degrades standard models. Therefore, we introduce Point-MoE, a Mixture-of-Experts design that expands model capacity through sparsely activated expert MLPs and a lightweight top-$k$ router, allowing tokens to select specialized experts without requiring dataset supervision. Trained jointly on a diverse mix of indoor and outdoor datasets, and evaluated on seen datasets as well as in zero-shot settings, Point-MoE outperforms prior methods without using dataset labels for either training or inference. This outlines a scalable path for 3D perception: letting the model discover structure in heterogeneous 3D data rather than imposing it via manual curation or dataset-specific heuristics.

Point-MoE: Large-Scale Multi-Dataset Training with Mixture-of-Experts for 3D Semantic Segmentation

TL;DR

This work introduces Point-MoE, a Mixture-of-Experts design that expands model capacity through sparsely activated expert MLPs and a lightweight top- router, allowing tokens to select specialized experts without requiring dataset supervision.

Abstract

While massively scaling both data and models have become central in NLP and 2D vision, their benefits for 3D point cloud understanding remain limited. We study the initial step of scaling 3D point cloud understanding under a realistic regime: large-scale multi-dataset joint training for 3D semantic segmentation, with no dataset labels available at training or inference time. Point clouds arise from a wide range of sensors (e.g., depth cameras, LiDAR) and scenes (\eg, indoor, outdoor), yielding heterogeneous scanning patterns, sampling densities, and semantic biases; naively mixing such datasets degrades standard models. Therefore, we introduce Point-MoE, a Mixture-of-Experts design that expands model capacity through sparsely activated expert MLPs and a lightweight top- router, allowing tokens to select specialized experts without requiring dataset supervision. Trained jointly on a diverse mix of indoor and outdoor datasets, and evaluated on seen datasets as well as in zero-shot settings, Point-MoE outperforms prior methods without using dataset labels for either training or inference. This outlines a scalable path for 3D perception: letting the model discover structure in heterogeneous 3D data rather than imposing it via manual curation or dataset-specific heuristics.

Paper Structure

This paper contains 27 sections, 3 equations, 20 figures, 22 tables.

Figures (20)

  • Figure 1: Overview of multi-datasets training architectures. Point clouds exhibit diverse characteristics across datasets. (a) Naively training Point Transformer V3 (PTv3) ptv3 on multi-datasets data leads to degraded performance within each domain. (b) Point Prompt Training (PPT) wu2024ppt addresses this by adding dataset-aware normalization parameters. (c) Our proposed Point-MoE tackles this challenge with Mixture-of-Experts (MoE), enabling dynamic expert specialization across datasets.
  • Figure 2: t-SNE visualization of feature clustering. The first three columns show that the decoders of both PPT and PointMoE have better separation between datasets in their decoder representations, than vanilla PTv3. The rightmost column highlights how Point-MoE generalizes to zero-shot datasets, with Matterport3D features aligning closely with ScanNet, which is indeed semantically similar.
  • Figure 3: Expert choice visualization. Different colors within each image indicate different experts, showing that Point-MoE self-organizes point routing based on spatial and semantic cues across different layers. (a) shows one expert (green) focusing on edges, suggesting spatially aware mid-level routing in the encoder. (b) and (c) show that semantically related regions (e.g., chairs and desks) are consistently assigned to the same expert, even across datasets. (d) shows meaningful routing in an outdoor scene with sparse LiDAR points despite lower point density and more irregular geometry.
  • Figure 4: Expert routing across datasets.Left: the most frequent expert paths through each encoder (E) and decoder (D) layer, with channel sizes in parentheses, showing that expert selection varies across datasets. The effect is especially pronounced in decoder MoE layers. Right: Jensen Shannon Divergence (JSD) between dataset specific expert selection distributions at each MoE layer.
  • Figure 6: Expert Specialization Word Cloud. All classes’ frequencies of selecting their top-1 expert are measured on the validation set. Colors indicate the originating dataset of the class (e.g., ScanNet, S3DIS, etc.), providing insight into cross-domain consistency.
  • ...and 15 more figures