DoReMi: A Domain-Representation Mixture Framework for Generalizable 3D Understanding
Mingwei Xing, Xinliang Wang, Yifeng Shi
TL;DR
DoReMi tackles cross-domain generalization in 3D point-cloud understanding by integrating domain-aware and unified representations within a Mixture-of-Experts framework. It combines a domain-aware Do branch with Domain-Guided Spatial Routing (DSR) and Entropy-Controlled Dynamic Allocation (EDA) alongside a frozen unified Representation branch (Re) pretrained through robust multi-attribute self-supervised learning. The Re branch preserves cross-domain geometric priors, while Do learns domain-specific adaptations through adaptive expert routing, guided by spatial context and uncertainty. Empirical results across ScanNet, S3DIS, nuScenes, Waymo, Matterport3D, and ARKitScenes demonstrate state-of-the-art or competitive performance for indoor/outdoor segmentation and multimodal detection, highlighting strong cross-domain generalization and efficiency advantages. DoReMi thus offers a scalable, generalizable foundation for future 3D understanding research and applications.
Abstract
The generalization of 3D deep learning across multiple domains remains limited by the limited scale of existing datasets and the high heterogeneity of multi-source point clouds. Point clouds collected from different sensors (e.g., LiDAR scans and mesh-derived point clouds) exhibit substantial discrepancies in density and noise distribution, resulting in negative transfer during multi-domain fusion. Most existing approaches focus exclusively on either domain-aware or domain-general features, overlooking the potential synergy between them. To address this, we propose DoReMi (Domain-Representation Mixture), a Mixture-of-Experts (MoE) framework that jointly models Domain-aware Experts branch and a unified Representation branch to enable cooperative learning between specialized and generalizable knowledge. DoReMi dynamically activates domain-aware expert branch via Domain-Guided Spatial Routing (DSR) for context-aware expert selection and employs Entropy-Controlled Dynamic Allocation (EDA) for stable and efficient expert utilization, thereby adaptively modeling diverse domain distributions. Complemented by a frozen unified representation branch pretrained through robust multi-attribute self-supervised learning, DoReMi preserves cross-domain geometric and structural priors while maintaining global consistency. We evaluate DoReMi across multiple 3D understanding benchmarks. Notably, DoReMi achieves 80.1% mIoU on ScanNet Val and 77.2% mIoU on S3DIS, demonstrating competitive or superior performance compared to existing approaches, and showing strong potential as a foundation framework for future 3D understanding research. The code will be released soon.
