Table of Contents
Fetching ...

DUNE: Distilling a Universal Encoder from Heterogeneous 2D and 3D Teachers

Mert Bulent Sariyildiz, Philippe Weinzaepfel, Thomas Lucas, Pau de Jorge, Diane Larlus, Yannis Kalantidis

TL;DR

The paper addresses the challenge of distilling from heterogeneous teachers—models trained for different tasks and on diverse data domains—into a single universal encoder. It proposes DUNE, a ViT-Base encoder distilled from DINO-v2, MASt3R, and Multi-HMR, using task-specific Transformer Projectors and various data-sharing strategies to harmonize cross-domain representations. Across 2D vision, 3D scene understanding, and 3D human perception tasks, DUNE matches or surpasses its larger teacher encoders and even sets a new state-of-the-art on Map-free Visual Relocalization while maintaining strong generalization. The work demonstrates that carefully designed data curation, projector architecture, and selective fine-tuning enable efficient, multi-task inference with a single backbone, highlighting practical benefits for cross-domain perception systems.

Abstract

Recent multi-teacher distillation methods have unified the encoders of multiple foundation models into a single encoder, achieving competitive performance on core vision tasks like classification, segmentation, and depth estimation. This led us to ask: Could similar success be achieved when the pool of teachers also includes vision models specialized in diverse tasks across both 2D and 3D perception? In this paper, we define and investigate the problem of heterogeneous teacher distillation, or co-distillation, a challenging multi-teacher distillation scenario where teacher models vary significantly in both (a) their design objectives and (b) the data they were trained on. We explore data-sharing strategies and teacher-specific encoding, and introduce DUNE, a single encoder excelling in 2D vision, 3D understanding, and 3D human perception. Our model achieves performance comparable to that of its larger teachers, sometimes even outperforming them, on their respective tasks. Notably, DUNE surpasses MASt3R in Map-free Visual Relocalization with a much smaller encoder.

DUNE: Distilling a Universal Encoder from Heterogeneous 2D and 3D Teachers

TL;DR

The paper addresses the challenge of distilling from heterogeneous teachers—models trained for different tasks and on diverse data domains—into a single universal encoder. It proposes DUNE, a ViT-Base encoder distilled from DINO-v2, MASt3R, and Multi-HMR, using task-specific Transformer Projectors and various data-sharing strategies to harmonize cross-domain representations. Across 2D vision, 3D scene understanding, and 3D human perception tasks, DUNE matches or surpasses its larger teacher encoders and even sets a new state-of-the-art on Map-free Visual Relocalization while maintaining strong generalization. The work demonstrates that carefully designed data curation, projector architecture, and selective fine-tuning enable efficient, multi-task inference with a single backbone, highlighting practical benefits for cross-domain perception systems.

Abstract

Recent multi-teacher distillation methods have unified the encoders of multiple foundation models into a single encoder, achieving competitive performance on core vision tasks like classification, segmentation, and depth estimation. This led us to ask: Could similar success be achieved when the pool of teachers also includes vision models specialized in diverse tasks across both 2D and 3D perception? In this paper, we define and investigate the problem of heterogeneous teacher distillation, or co-distillation, a challenging multi-teacher distillation scenario where teacher models vary significantly in both (a) their design objectives and (b) the data they were trained on. We explore data-sharing strategies and teacher-specific encoding, and introduce DUNE, a single encoder excelling in 2D vision, 3D understanding, and 3D human perception. Our model achieves performance comparable to that of its larger teachers, sometimes even outperforming them, on their respective tasks. Notably, DUNE surpasses MASt3R in Map-free Visual Relocalization with a much smaller encoder.

Paper Structure

This paper contains 57 sections, 4 equations, 16 figures, 11 tables.

Figures (16)

  • Figure 1: DUNE is a universal encoder for 2D and 3D tasks distilled from heterogeneous teachers. It enables multi-task inference with a single encoder. Teachers are DINO-v2 oquab2024dinov2, MASt3R mast3r, and Multi-HMR multihmr (see Fig. \ref{['fig:overview']} for distillation details).
  • Figure 2: PCA visualization of encoder outputs. Given an image, we extract patch embeddings from the encoders of the teacher models and our student, and reduce their dimension to 3 via PCA.
  • Figure 3: Overview of the DUNE encoder training process. (a) DUNE is trained via distillation from heterogeneous teachers across 2D vision, 3D vision, and 3D human perception, leveraging diverse data from multiple visual domains. We use teacher dropping regularization from sariyildiz2024unic. (b) Task-specific heads are then fine-tuned independently for each task, with the DUNE encoder kept frozen.
  • Figure 4: Cumulative explained variance computed over features from three representative datasets, for the three teacher encoders (solid lines) and student's projectors (dashed lines).
  • Figure 5: Correlation of loss updates during training for each pair of teachers when training with different strategies. Training with TP leads to more alignment between teachers regardless of the training data. On the other hand, using all data with all teachers seems to be the best data strategy to improve teacher alignment.
  • ...and 11 more figures