Table of Contents
Fetching ...

Concerto: Joint 2D-3D Self-Supervised Learning Emerges Spatial Representations

Yujia Zhang, Xiaoyang Wu, Yixing Lao, Chengyao Wang, Zhuotao Tian, Naiyan Wang, Hengshuang Zhao

TL;DR

Concerto introduces a joint 2D-3D self-supervised framework that fuses intra-modal 3D self-distillation with cross-modal embedding prediction to learn coherent spatial representations. By aligning 3D point features with 2D image patches via camera geometry and a cosine-based objective, it yields emergent representations that outperform single-modality SSL baselines and naive feature fusion, achieving state-of-the-art results on ScanNet and related benchmarks, both with linear probes and full fine-tuning. The approach extends to video-lifted point clouds and includes a CLIP-based language projection to assess open-world grounding. Complementary ablations highlight the importance of cross-modal synergy, data/model scaling, and efficient fine-tuning via LoRA, pointing to strong potential for scalable, multi-domain spatial understanding. Overall, Concerto demonstrates that multi-modal self-supervision can produce richer, more predictive spatial concepts than single-modality learning alone, with practical implications for open-world perception and scalable 3D scene understanding.

Abstract

Humans learn abstract concepts through multisensory synergy, and once formed, such representations can often be recalled from a single modality. Inspired by this principle, we introduce Concerto, a minimalist simulation of human concept learning for spatial cognition, combining 3D intra-modal self-distillation with 2D-3D cross-modal joint embedding. Despite its simplicity, Concerto learns more coherent and informative spatial features, as demonstrated by zero-shot visualizations. It outperforms both standalone SOTA 2D and 3D self-supervised models by 14.2% and 4.8%, respectively, as well as their feature concatenation, in linear probing for 3D scene perception. With full fine-tuning, Concerto sets new SOTA results across multiple scene understanding benchmarks (e.g., 80.7% mIoU on ScanNet). We further present a variant of Concerto tailored for video-lifted point cloud spatial understanding, and a translator that linearly projects Concerto representations into CLIP's language space, enabling open-world perception. These results highlight that Concerto emerges spatial representations with superior fine-grained geometric and semantic consistency.

Concerto: Joint 2D-3D Self-Supervised Learning Emerges Spatial Representations

TL;DR

Concerto introduces a joint 2D-3D self-supervised framework that fuses intra-modal 3D self-distillation with cross-modal embedding prediction to learn coherent spatial representations. By aligning 3D point features with 2D image patches via camera geometry and a cosine-based objective, it yields emergent representations that outperform single-modality SSL baselines and naive feature fusion, achieving state-of-the-art results on ScanNet and related benchmarks, both with linear probes and full fine-tuning. The approach extends to video-lifted point clouds and includes a CLIP-based language projection to assess open-world grounding. Complementary ablations highlight the importance of cross-modal synergy, data/model scaling, and efficient fine-tuning via LoRA, pointing to strong potential for scalable, multi-domain spatial understanding. Overall, Concerto demonstrates that multi-modal self-supervision can produce richer, more predictive spatial concepts than single-modality learning alone, with practical implications for open-world perception and scalable 3D scene understanding.

Abstract

Humans learn abstract concepts through multisensory synergy, and once formed, such representations can often be recalled from a single modality. Inspired by this principle, we introduce Concerto, a minimalist simulation of human concept learning for spatial cognition, combining 3D intra-modal self-distillation with 2D-3D cross-modal joint embedding. Despite its simplicity, Concerto learns more coherent and informative spatial features, as demonstrated by zero-shot visualizations. It outperforms both standalone SOTA 2D and 3D self-supervised models by 14.2% and 4.8%, respectively, as well as their feature concatenation, in linear probing for 3D scene perception. With full fine-tuning, Concerto sets new SOTA results across multiple scene understanding benchmarks (e.g., 80.7% mIoU on ScanNet). We further present a variant of Concerto tailored for video-lifted point cloud spatial understanding, and a translator that linearly projects Concerto representations into CLIP's language space, enabling open-world perception. These results highlight that Concerto emerges spatial representations with superior fine-grained geometric and semantic consistency.

Paper Structure

This paper contains 23 sections, 2 equations, 6 figures, 20 tables.

Figures (6)

  • Figure 1: The "Apple" concept in cognition.
  • Figure 2: Overview of the Concerto architecture. Concerto simulates human multisensory synergy by coupling (a) intra-modal self-distillation on 3D point clouds to progressively refine its internal spatial representations (see Sec. \ref{['sec:intra']}), and (b) cross-modal joint embedding prediction that aligns point features with corresponding image patch features using camera parameters (see Sec. \ref{['sec:cross']}). The self-distillation branch (a) employs a restricted online clustering objective, while the joint embedding prediction (b) applies a looser cosine similarity constraint. This dual self-supervised objective encourages the emergence of coherent, modality-agnostic spatial representations.
  • Figure 3: Video spatial perception. Concerto can be directly applied to video-lifted data (top row). The PCA visualizations (bottom two rows) illustrate that Concerto learns more fine-grained and semantically consistent features compared to DINOv2.
  • Figure 4: Qualitative visualization. Concerto performs well across different point cloud inputs: a complete scene (top two rows) and an incomplete scene (bottom two rows).
  • Figure 5: Video perception. Concerto can be applied to single-view (top row) and multi-view video-lifted data (bottom three rows). We visualize the PCA of one video in RE10K zhou2018re10k. In the multi-view setting, the representations from all the frames are computed together for consistency.
  • ...and 1 more figures