Table of Contents
Fetching ...

Mutli-View 3D Reconstruction using Knowledge Distillation

Aditya Dutt, Ishikaa Lunawat, Manpreet Kaur

TL;DR

The paper tackles the computational burden of large foundation models like Dust3R for multi-view 3D reconstruction by introducing a knowledge-distillation workflow that trains lightweight CNN and Vision Transformer students to predict world-coordinate 3D points using Dust3R as the teacher. The pipeline computes pairwise image inferences, applies a global alignment, and optimizes with a mean-squared-error objective to mimic Dust3R outputs on the 12Scenes dataset. Comprehensive ablations show that the Vision Transformer student delivers the best combination of accuracy and compactness (5–45 MB) compared to CNN-based students, which underperform on full scene geometry but learn efficiently with pretrained backbones. The work demonstrates that scene-specific, lightweight models can achieve Dust3R-like quality at a fraction of the compute, enabling real-time or edge deployments for downstream tasks such as Visual Localization and SLAM.

Abstract

Large Foundation Models like Dust3r can produce high quality outputs such as pointmaps, camera intrinsics, and depth estimation, given stereo-image pairs as input. However, the application of these outputs on tasks like Visual Localization requires a large amount of inference time and compute resources. To address these limitations, in this paper, we propose the use of a knowledge distillation pipeline, where we aim to build a student-teacher model with Dust3r as the teacher and explore multiple architectures of student models that are trained using the 3D reconstructed points output by Dust3r. Our goal is to build student models that can learn scene-specific representations and output 3D points with replicable performance such as Dust3r. The data set we used to train our models is 12Scenes. We test two main architectures of models: a CNN-based architecture and a Vision Transformer based architecture. For each architecture, we also compare the use of pre-trained models against models built from scratch. We qualitatively compare the reconstructed 3D points output by the student model against Dust3r's and discuss the various features learned by the student model. We also perform ablation studies on the models through hyperparameter tuning. Overall, we observe that the Vision Transformer presents the best performance visually and quantitatively.

Mutli-View 3D Reconstruction using Knowledge Distillation

TL;DR

The paper tackles the computational burden of large foundation models like Dust3R for multi-view 3D reconstruction by introducing a knowledge-distillation workflow that trains lightweight CNN and Vision Transformer students to predict world-coordinate 3D points using Dust3R as the teacher. The pipeline computes pairwise image inferences, applies a global alignment, and optimizes with a mean-squared-error objective to mimic Dust3R outputs on the 12Scenes dataset. Comprehensive ablations show that the Vision Transformer student delivers the best combination of accuracy and compactness (5–45 MB) compared to CNN-based students, which underperform on full scene geometry but learn efficiently with pretrained backbones. The work demonstrates that scene-specific, lightweight models can achieve Dust3R-like quality at a fraction of the compute, enabling real-time or edge deployments for downstream tasks such as Visual Localization and SLAM.

Abstract

Large Foundation Models like Dust3r can produce high quality outputs such as pointmaps, camera intrinsics, and depth estimation, given stereo-image pairs as input. However, the application of these outputs on tasks like Visual Localization requires a large amount of inference time and compute resources. To address these limitations, in this paper, we propose the use of a knowledge distillation pipeline, where we aim to build a student-teacher model with Dust3r as the teacher and explore multiple architectures of student models that are trained using the 3D reconstructed points output by Dust3r. Our goal is to build student models that can learn scene-specific representations and output 3D points with replicable performance such as Dust3r. The data set we used to train our models is 12Scenes. We test two main architectures of models: a CNN-based architecture and a Vision Transformer based architecture. For each architecture, we also compare the use of pre-trained models against models built from scratch. We qualitatively compare the reconstructed 3D points output by the student model against Dust3r's and discuss the various features learned by the student model. We also perform ablation studies on the models through hyperparameter tuning. Overall, we observe that the Vision Transformer presents the best performance visually and quantitatively.

Paper Structure

This paper contains 12 sections, 4 figures.

Figures (4)

  • Figure 1: Reconstructed Kitchen scene with camera poses using DUST3R model and global optimization method
  • Figure 9: ViT Output of 2 couch scenes - taken from different angles
  • Figure 10: Training loss vs. epochs for different scenes
  • Figure 11: Reconstructed Kitchen scene with camera poses using DUST3R model and global optimization method