Concerto: Joint 2D-3D Self-Supervised Learning Emerges Spatial Representations
Yujia Zhang, Xiaoyang Wu, Yixing Lao, Chengyao Wang, Zhuotao Tian, Naiyan Wang, Hengshuang Zhao
TL;DR
Concerto introduces a joint 2D-3D self-supervised framework that fuses intra-modal 3D self-distillation with cross-modal embedding prediction to learn coherent spatial representations. By aligning 3D point features with 2D image patches via camera geometry and a cosine-based objective, it yields emergent representations that outperform single-modality SSL baselines and naive feature fusion, achieving state-of-the-art results on ScanNet and related benchmarks, both with linear probes and full fine-tuning. The approach extends to video-lifted point clouds and includes a CLIP-based language projection to assess open-world grounding. Complementary ablations highlight the importance of cross-modal synergy, data/model scaling, and efficient fine-tuning via LoRA, pointing to strong potential for scalable, multi-domain spatial understanding. Overall, Concerto demonstrates that multi-modal self-supervision can produce richer, more predictive spatial concepts than single-modality learning alone, with practical implications for open-world perception and scalable 3D scene understanding.
Abstract
Humans learn abstract concepts through multisensory synergy, and once formed, such representations can often be recalled from a single modality. Inspired by this principle, we introduce Concerto, a minimalist simulation of human concept learning for spatial cognition, combining 3D intra-modal self-distillation with 2D-3D cross-modal joint embedding. Despite its simplicity, Concerto learns more coherent and informative spatial features, as demonstrated by zero-shot visualizations. It outperforms both standalone SOTA 2D and 3D self-supervised models by 14.2% and 4.8%, respectively, as well as their feature concatenation, in linear probing for 3D scene perception. With full fine-tuning, Concerto sets new SOTA results across multiple scene understanding benchmarks (e.g., 80.7% mIoU on ScanNet). We further present a variant of Concerto tailored for video-lifted point cloud spatial understanding, and a translator that linearly projects Concerto representations into CLIP's language space, enabling open-world perception. These results highlight that Concerto emerges spatial representations with superior fine-grained geometric and semantic consistency.
