Table of Contents
Fetching ...

Splat and Distill: Augmenting Teachers with Feed-Forward 3D Reconstruction For 3D-Aware Distillation

David Shavin, Sagie Benaim

TL;DR

This work tackles the lack of 3D awareness in Vision Foundation Models by introducing Splat and Distill (SnD), a framework that augments the teacher with a fast, feed-forward 3D reconstruction pipeline. The teacher lifts 2D context-view features into a 3D Gaussian scene, splats them to a novel viewpoint, and provides a geometrically grounded supervisory signal for distilling into a student via a DINO-style objective with EMA teacher updates. Key contributions include mask-aware feature lifting, semantic blending for regularization, and a robust distillation objective that yields improved monocular depth, surface normals, multi-view correspondence, and semantic segmentation—across multiple datasets and backbone sizes. The approach demonstrates scalable 3D-aware feature learning that enhances both geometric understanding and semantic richness in 2D VFMs, with strong cross-task and cross-domain performance.

Abstract

Vision Foundation Models (VFMs) have achieved remarkable success when applied to various downstream 2D tasks. Despite their effectiveness, they often exhibit a critical lack of 3D awareness. To this end, we introduce Splat and Distill, a framework that instills robust 3D awareness into 2D VFMs by augmenting the teacher model with a fast, feed-forward 3D reconstruction pipeline. Given 2D features produced by a teacher model, our method first lifts these features into an explicit 3D Gaussian representation, in a feedforward manner. These 3D features are then ``splatted" onto novel viewpoints, producing a set of novel 2D feature maps used to supervise the student model, ``distilling" geometrically grounded knowledge. By replacing slow per-scene optimization of prior work with our feed-forward lifting approach, our framework avoids feature-averaging artifacts, creating a dynamic learning process where the teacher's consistency improves alongside that of the student. We conduct a comprehensive evaluation on a suite of downstream tasks, including monocular depth estimation, surface normal estimation, multi-view correspondence, and semantic segmentation. Our method significantly outperforms prior works, not only achieving substantial gains in 3D awareness but also enhancing the underlying semantic richness of 2D features. Project page is available at https://davidshavin4.github.io/Splat-and-Distill/

Splat and Distill: Augmenting Teachers with Feed-Forward 3D Reconstruction For 3D-Aware Distillation

TL;DR

This work tackles the lack of 3D awareness in Vision Foundation Models by introducing Splat and Distill (SnD), a framework that augments the teacher with a fast, feed-forward 3D reconstruction pipeline. The teacher lifts 2D context-view features into a 3D Gaussian scene, splats them to a novel viewpoint, and provides a geometrically grounded supervisory signal for distilling into a student via a DINO-style objective with EMA teacher updates. Key contributions include mask-aware feature lifting, semantic blending for regularization, and a robust distillation objective that yields improved monocular depth, surface normals, multi-view correspondence, and semantic segmentation—across multiple datasets and backbone sizes. The approach demonstrates scalable 3D-aware feature learning that enhances both geometric understanding and semantic richness in 2D VFMs, with strong cross-task and cross-domain performance.

Abstract

Vision Foundation Models (VFMs) have achieved remarkable success when applied to various downstream 2D tasks. Despite their effectiveness, they often exhibit a critical lack of 3D awareness. To this end, we introduce Splat and Distill, a framework that instills robust 3D awareness into 2D VFMs by augmenting the teacher model with a fast, feed-forward 3D reconstruction pipeline. Given 2D features produced by a teacher model, our method first lifts these features into an explicit 3D Gaussian representation, in a feedforward manner. These 3D features are then ``splatted" onto novel viewpoints, producing a set of novel 2D feature maps used to supervise the student model, ``distilling" geometrically grounded knowledge. By replacing slow per-scene optimization of prior work with our feed-forward lifting approach, our framework avoids feature-averaging artifacts, creating a dynamic learning process where the teacher's consistency improves alongside that of the student. We conduct a comprehensive evaluation on a suite of downstream tasks, including monocular depth estimation, surface normal estimation, multi-view correspondence, and semantic segmentation. Our method significantly outperforms prior works, not only achieving substantial gains in 3D awareness but also enhancing the underlying semantic richness of 2D features. Project page is available at https://davidshavin4.github.io/Splat-and-Distill/
Paper Structure (28 sections, 5 equations, 14 figures, 5 tables)

This paper contains 28 sections, 5 equations, 14 figures, 5 tables.

Figures (14)

  • Figure 1: Splat and Distill (SnD) is a student-teacher distillation framework that augments the teacher with a feed-forward 3D reconstruction pipeline during training, resulting in 3D-aware 2D features. Left: Leveraging our approach on DINOv2, results in 2D features that enable state-of-the-art performance on downstream tasks such as monocular depth estimation (Depth), surface normal estimation (sNorm), semantic segmentation (Seg), and multiview correspondence (Corres). Shown here is comparison of SnD (our method) to vanilla DINOv2, and state-of-the-art approaches for improving 3D awarness,Fit3Dyue2024improving, and MEFyou2024multiview, based on a DINOv2 VIT-Small model, and considering the NYUv2 silberman2012indoor, ScanNet dai2017scannet and ScanNet++ yeshwanth2023scannet++ datasets (see further results in Sec. \ref{['sec:experiments']}). For visualization, we provide normalized scores (min–max per metric, weakest baseline set to 0), using inverse RMSE for depth and normal estimation, IoU for segmentation, and Recall for correspondence (higher is better). See additional details in Sec. \ref{['sec:experiments']}. Right: Visualization of our method compared to DINOv2.
  • Figure 2: Method Overview. Starting from the LHS, two context views $\textbf{I}_j^{ctx}$ are passed through a teacher network, producing two low-resolution 2D feature maps $\textbf{F}_j^{ctx}$. Using corresponding semantic masks, mask-aware upscaling (Sec. \ref{['sec:mask_aware_upscaling']}) produces 2D features $\textbf{F}_j^{high}$ of the input resolution. In parallel, a pretrained feed-forward 3D reconstruction model predicts 3D Gaussian primitives $\textcolor{Cerulean}{\{\mu_j,\Sigma_j,\alpha_j\}}$ using the same context views $\textbf{I}_j^{ctx}$ (Sec. \ref{['sec:ff']}). The upscaled 2D feature maps, $\textbf{F}_j^{high}$, are then lifted to these 3D Gaussian primitives, using 2D-3D correspondences, yielding a feature-augmented GS scene $\mathcal{G}_j \leftarrow \textcolor{Cerulean}{\{\mu_j,\Sigma_j,\alpha_j\}}\cup\textcolor{Mulberry}{\{\mathbf{f}_j\}}$ (Sec. \ref{['sec:mask_aware_upscaling']}). Next, the scene is splatted to a target viewpoint, producing a 2D feature map, which is then blended with the semantic mask of the target view, resulting in 2D features $\mathbf{F}_{\text{blend}}^{\text{tgt}}$ (Sec. \ref{['sec:blending']}). Concurrently, as shown on the RHS, the target image $\textbf{I}^{tgt}$ (corresponding to the rendered viewpoint) is passed through the student network to obtain its feature map $\mathbf{F}_{\text{s}}^{\text{tgt}}$. $\mathbf{F}_{\text{blend}}^{\text{tgt}}$ is then downscaled (bilinearly) producing a lower resolution 2D feature map which is compared to $\mathbf{F}_{\text{s}}^{\text{tgt}}$ to supervise the student via a distillation loss (Sec. \ref{['sec:losses']}). The teacher's weights are updated as an EMA of the student's weights. Note that SnD is finetuned on ScanNet++.
  • Figure 3: Qualitative comparison for monocular depth estimation using ViT-Small backbone (GT=Ground Truth).
  • Figure 4: Qualitative comparison of surface normals estimation using ViT-Small backbone.
  • Figure 5: Qualitative comparison of multi-view correspondences using ViT-Small backbone. Lines connect matched points between the two views; color encodes the 2D Euclidean reprojection error computed under the ground-truth pose, with green/red indicating small/large error, respectively.
  • ...and 9 more figures