DUNE: Distilling a Universal Encoder from Heterogeneous 2D and 3D Teachers
Mert Bulent Sariyildiz, Philippe Weinzaepfel, Thomas Lucas, Pau de Jorge, Diane Larlus, Yannis Kalantidis
TL;DR
The paper addresses the challenge of distilling from heterogeneous teachers—models trained for different tasks and on diverse data domains—into a single universal encoder. It proposes DUNE, a ViT-Base encoder distilled from DINO-v2, MASt3R, and Multi-HMR, using task-specific Transformer Projectors and various data-sharing strategies to harmonize cross-domain representations. Across 2D vision, 3D scene understanding, and 3D human perception tasks, DUNE matches or surpasses its larger teacher encoders and even sets a new state-of-the-art on Map-free Visual Relocalization while maintaining strong generalization. The work demonstrates that carefully designed data curation, projector architecture, and selective fine-tuning enable efficient, multi-task inference with a single backbone, highlighting practical benefits for cross-domain perception systems.
Abstract
Recent multi-teacher distillation methods have unified the encoders of multiple foundation models into a single encoder, achieving competitive performance on core vision tasks like classification, segmentation, and depth estimation. This led us to ask: Could similar success be achieved when the pool of teachers also includes vision models specialized in diverse tasks across both 2D and 3D perception? In this paper, we define and investigate the problem of heterogeneous teacher distillation, or co-distillation, a challenging multi-teacher distillation scenario where teacher models vary significantly in both (a) their design objectives and (b) the data they were trained on. We explore data-sharing strategies and teacher-specific encoding, and introduce DUNE, a single encoder excelling in 2D vision, 3D understanding, and 3D human perception. Our model achieves performance comparable to that of its larger teachers, sometimes even outperforming them, on their respective tasks. Notably, DUNE surpasses MASt3R in Map-free Visual Relocalization with a much smaller encoder.
