Table of Contents
Fetching ...

UNIC: Universal Classification Models via Multi-teacher Distillation

Mert Bulent Sariyildiz, Philippe Weinzaepfel, Thomas Lucas, Diane Larlus, Yannis Kalantidis

TL;DR

UNIC shows that a single ViT encoder can generalize across ImageNet, transfer, and patch-based tasks by distilling from multiple complementary teachers. The approach hinges on a ladder of expendable projectors and a teacher dropping regularization scheme to balance diverse teacher signals, producing encoders that match or exceed the best teacher across tasks. Empirical results on image- and patch-level tasks, including dense predictions like segmentation and depth, demonstrate strong generalization and efficient weight/feature-space utilization. This work advances universal representation learning by enabling task-agnostic, plug-and-play classification encoders without task-specific adapters, with broader implications for robust, general-purpose visual representations.

Abstract

Pretrained models have become a commodity and offer strong results on a broad range of tasks. In this work, we focus on classification and seek to learn a unique encoder able to take from several complementary pretrained models. We aim at even stronger generalization across a variety of classification tasks. We propose to learn such an encoder via multi-teacher distillation. We first thoroughly analyse standard distillation when driven by multiple strong teachers with complementary strengths. Guided by this analysis, we gradually propose improvements to the basic distillation setup. Among those, we enrich the architecture of the encoder with a ladder of expendable projectors, which increases the impact of intermediate features during distillation, and we introduce teacher dropping, a regularization mechanism that better balances the teachers' influence. Our final distillation strategy leads to student models of the same capacity as any of the teachers, while retaining or improving upon the performance of the best teacher for each task. Project page and code: https://europe.naverlabs.com/unic

UNIC: Universal Classification Models via Multi-teacher Distillation

TL;DR

UNIC shows that a single ViT encoder can generalize across ImageNet, transfer, and patch-based tasks by distilling from multiple complementary teachers. The approach hinges on a ladder of expendable projectors and a teacher dropping regularization scheme to balance diverse teacher signals, producing encoders that match or exceed the best teacher across tasks. Empirical results on image- and patch-level tasks, including dense predictions like segmentation and depth, demonstrate strong generalization and efficient weight/feature-space utilization. This work advances universal representation learning by enabling task-agnostic, plug-and-play classification encoders without task-specific adapters, with broader implications for robust, general-purpose visual representations.

Abstract

Pretrained models have become a commodity and offer strong results on a broad range of tasks. In this work, we focus on classification and seek to learn a unique encoder able to take from several complementary pretrained models. We aim at even stronger generalization across a variety of classification tasks. We propose to learn such an encoder via multi-teacher distillation. We first thoroughly analyse standard distillation when driven by multiple strong teachers with complementary strengths. Guided by this analysis, we gradually propose improvements to the basic distillation setup. Among those, we enrich the architecture of the encoder with a ladder of expendable projectors, which increases the impact of intermediate features during distillation, and we introduce teacher dropping, a regularization mechanism that better balances the teachers' influence. Our final distillation strategy leads to student models of the same capacity as any of the teachers, while retaining or improving upon the performance of the best teacher for each task. Project page and code: https://europe.naverlabs.com/unic
Paper Structure (61 sections, 5 equations, 7 figures, 13 tables)

This paper contains 61 sections, 5 equations, 7 figures, 13 tables.

Figures (7)

  • Figure 2: Relative gains using our UNIC encoder distilled from four teachers (DINO, DeiT-III, iBOT, dBOT-ft), over the respective best teacher for each task. UNIC solves all classification tasks using a single encoder and no task-specific parameters.
  • Figure 3: Overview of our multi-teacher distillation setup. The same input image is fed to each teacher and to student. We employ feature standardization at the output of all teachers (\ref{['sec:tokens']}), a ladder of expandable projectors attached to student (\ref{['sec:projectors']}) and teacher dropping regularization to balance teachers (\ref{['sec:balancing']}). The latter enables us to adaptively select a subset of teachers to contribute to the loss simply using loss magnitudes. We use dedicated projectors for the CLS and patch tokens (\ref{['sec:tokens']}).
  • Figure 4: Analyzing teacher dropping regularization (tdrop). (a) Loss for each of the two teachers during multi-teacher distillation, with and without tdrop. (b) ImageNet-1K top-1 accuracy when distilling from DINO & DeiT-III together, versus distilling only from DeiT-III, i.e. the teacher that excels at this task.
  • Figure 5: Teacher coefficients$\alpha_t$ during distillation from DeiT and DINO.
  • Figure 6: Performance of different UNIC encoders on different pairs of tasks. We report performance for UNIC encoders distilled from DINO & DeiT-III, iBOT & dBOT-ft and distilling from all four teachers together. We show results on ImageNet-1K (a), over 15 transfer learning tasks (a, b), semantic segmentation (b, c) and depth estimation (c).
  • ...and 2 more figures