Table of Contents
Fetching ...

Chorus: Multi-Teacher Pretraining for Holistic 3D Gaussian Scene Encoding

Yue Li, Qi Ma, Runyi Yang, Mengjiao Ma, Bin Ren, Nikola Popovic, Nicu Sebe, Theo Gevers, Luc Van Gool, Danda Pani Paudel, Martin R. Oswald

TL;DR

Chorus tackles the challenge of learning a general-purpose 3D scene encoder directly from 3D Gaussian splats by distilling signals from multiple 2D foundation models. It introduces a shared 3DGS encoder with per-teacher projections and a lift-then-align pipeline that fuses language-aligned, generalist, and object-aware cues into a cohesive 3D embedding. The approach achieves state-of-the-art results across open-vocabulary semantic and instance segmentation, probing, and data-efficient tasks on 3DGS benchmarks, and demonstrates strong transfer to point-cloud tasks with fewer training scenes. An additional render-and-distill adaptation enables lightweight out-of-domain fine-tuning without heavy 3D pseudo-labeling, and ablations validate the contribution of each teacher and augmentation. Overall, Chorus advances holistic 3D scene understanding by unifying rich semantic priors into a single, efficient 3DGS encoder.

Abstract

While 3DGS has emerged as a high-fidelity scene representation, encoding rich, general-purpose features directly from its primitives remains under-explored. We address this gap by introducing Chorus, a multi-teacher pretraining framework that learns a holistic feed-forward 3D Gaussian Splatting (3DGS) scene encoder by distilling complementary signals from 2D foundation models. Chorus employs a shared 3D encoder and teacher-specific projectors to learn from language-aligned, generalist, and object-aware teachers, encouraging a shared embedding space that captures signals from high-level semantics to fine-grained structure. We evaluate Chorus on a wide range of tasks: open-vocabulary semantic and instance segmentation, linear and decoder probing, as well as data-efficient supervision. Besides 3DGS, we also test Chorus on several benchmarks that only support point clouds by pretraining a variant using only Gaussians' centers, colors, estimated normals as inputs. Interestingly, this encoder shows strong transfer and outperforms the point clouds baseline while using 39.9 times fewer training scenes. Finally, we propose a render-and-distill adaptation that facilitates out-of-domain finetuning. Our code and model will be released upon publication.

Chorus: Multi-Teacher Pretraining for Holistic 3D Gaussian Scene Encoding

TL;DR

Chorus tackles the challenge of learning a general-purpose 3D scene encoder directly from 3D Gaussian splats by distilling signals from multiple 2D foundation models. It introduces a shared 3DGS encoder with per-teacher projections and a lift-then-align pipeline that fuses language-aligned, generalist, and object-aware cues into a cohesive 3D embedding. The approach achieves state-of-the-art results across open-vocabulary semantic and instance segmentation, probing, and data-efficient tasks on 3DGS benchmarks, and demonstrates strong transfer to point-cloud tasks with fewer training scenes. An additional render-and-distill adaptation enables lightweight out-of-domain fine-tuning without heavy 3D pseudo-labeling, and ablations validate the contribution of each teacher and augmentation. Overall, Chorus advances holistic 3D scene understanding by unifying rich semantic priors into a single, efficient 3DGS encoder.

Abstract

While 3DGS has emerged as a high-fidelity scene representation, encoding rich, general-purpose features directly from its primitives remains under-explored. We address this gap by introducing Chorus, a multi-teacher pretraining framework that learns a holistic feed-forward 3D Gaussian Splatting (3DGS) scene encoder by distilling complementary signals from 2D foundation models. Chorus employs a shared 3D encoder and teacher-specific projectors to learn from language-aligned, generalist, and object-aware teachers, encouraging a shared embedding space that captures signals from high-level semantics to fine-grained structure. We evaluate Chorus on a wide range of tasks: open-vocabulary semantic and instance segmentation, linear and decoder probing, as well as data-efficient supervision. Besides 3DGS, we also test Chorus on several benchmarks that only support point clouds by pretraining a variant using only Gaussians' centers, colors, estimated normals as inputs. Interestingly, this encoder shows strong transfer and outperforms the point clouds baseline while using 39.9 times fewer training scenes. Finally, we propose a render-and-distill adaptation that facilitates out-of-domain finetuning. Our code and model will be released upon publication.

Paper Structure

This paper contains 12 sections, 8 equations, 8 figures, 10 tables.

Figures (8)

  • Figure 1: Chorus Framework.(a) Multi-Teacher Pretraining. A feed-forward 3DGS scene encoder with per-teacher projectors distills complementary signals—language-aligned, generalist, and object-aware—into a shared embedding. (b) Example Feature PCA (results on novel scenes). At inference we input the full 3DGS scene; PCA on encoder features presents clear semantic awareness despite domain shift. (c) Evaluation & Data Efficiency. Chorus attains strong results across scene understanding tasks while using noticeably fewer training scenes—$8.32\times$ and $39.9\times$ less than the SoTA point clouds pretraining baselines—highlighting the efficiency of our pretraining.
  • Figure 2: Chorus Overview.(a) Multi-Teacher Pretraining. We train a feed-forward 3DGS scene encoder to distill complementary signals--language-aligned (SigLIP), generalist (DINO), and object-aware (PE)--from 2D teachers. This knowledge is transferred into a shared embedding space via lightweight per-teacher projectors and losses. To accelerate out-of-domain adaptation, we support finetuning the encoder with online rendering-based supervision. (b) Task-Specific Transfer. Pretrained Chorus encoder enables diverse downstream tasks, including semantic and instance segmentation, open-vocabulary query, and 3D visual question answering (VQA).
  • Figure 3: Rendering-Based View Sampling and Pairing: (a) Camera Location Sampling: We use Furthest Point Sampling to select camera positions that achieve broad spatial coverage across the entire navigable scene space. (b) Visibility Culling: For each location, we sample view angles and track the visibility of the 3D Gaussians across frames. (c) View Pairing and Selection: We obtain a minimum 2D bounding box covering all visible Gaussians for a given view. Then candidate pairs of poses are calculated and sorted based on the overlap score. (d,e,f) Rendered images corresponding to the colored camera viewpoints.
  • Figure 4: Inference Feature PCA Visualization. Features from different encoders on a concert hall. Chorus shows the best semantic consistency (see zoomed-in chairs and stairs in the back).
  • Figure 5: 2D Adaption Ablation. Performance improves with higher teacher render resolution (left) and more adaptation scenes (right). The left x-axis denotes the 2D teacher's feature resolution, formatted as (feature size) $\times$ bilinear upsample factor.
  • ...and 3 more figures