Table of Contents
Fetching ...

UDON: Universal Dynamic Online distillatioN for generic image representations

Nikolaos-Antonios Ypsilantis, Kaifeng Chen, André Araujo, Ondřej Chum

TL;DR

UDON addresses the challenge of learning a scalable universal image embedding capable of fine-grained, instance-level recognition across diverse domains. It introduces a multi-teacher online distillation framework with a shared backbone, where per-domain teachers guide a universal student via classification and distillation losses, and a dynamic sampler prioritizes slower-learning domains. The approach achieves state-of-the-art results on the UnED benchmark, especially on challenging long-tail domains, and significantly reduces compute cost by sharing parameters across teachers and the student. These advancements enable robust, scalable universal representations that bridge the gap between single-domain specialization and full-scale specialist ensembles.

Abstract

Universal image representations are critical in enabling real-world fine-grained and instance-level recognition applications, where objects and entities from any domain must be identified at large scale. Despite recent advances, existing methods fail to capture important domain-specific knowledge, while also ignoring differences in data distribution across different domains. This leads to a large performance gap between efficient universal solutions and expensive approaches utilising a collection of specialist models, one for each domain. In this work, we make significant strides towards closing this gap, by introducing a new learning technique, dubbed UDON (Universal Dynamic Online DistillatioN). UDON employs multi-teacher distillation, where each teacher is specialized in one domain, to transfer detailed domain-specific knowledge into the student universal embedding. UDON's distillation approach is not only effective, but also very efficient, by sharing most model parameters between the student and all teachers, where all models are jointly trained in an online manner. UDON also comprises a sampling technique which adapts the training process to dynamically allocate batches to domains which are learned slower and require more frequent processing. This boosts significantly the learning of complex domains which are characterised by a large number of classes and long-tail distributions. With comprehensive experiments, we validate each component of UDON, and showcase significant improvements over the state of the art in the recent UnED benchmark. Code: https://github.com/nikosips/UDON .

UDON: Universal Dynamic Online distillatioN for generic image representations

TL;DR

UDON addresses the challenge of learning a scalable universal image embedding capable of fine-grained, instance-level recognition across diverse domains. It introduces a multi-teacher online distillation framework with a shared backbone, where per-domain teachers guide a universal student via classification and distillation losses, and a dynamic sampler prioritizes slower-learning domains. The approach achieves state-of-the-art results on the UnED benchmark, especially on challenging long-tail domains, and significantly reduces compute cost by sharing parameters across teachers and the student. These advancements enable robust, scalable universal representations that bridge the gap between single-domain specialization and full-scale specialist ensembles.

Abstract

Universal image representations are critical in enabling real-world fine-grained and instance-level recognition applications, where objects and entities from any domain must be identified at large scale. Despite recent advances, existing methods fail to capture important domain-specific knowledge, while also ignoring differences in data distribution across different domains. This leads to a large performance gap between efficient universal solutions and expensive approaches utilising a collection of specialist models, one for each domain. In this work, we make significant strides towards closing this gap, by introducing a new learning technique, dubbed UDON (Universal Dynamic Online DistillatioN). UDON employs multi-teacher distillation, where each teacher is specialized in one domain, to transfer detailed domain-specific knowledge into the student universal embedding. UDON's distillation approach is not only effective, but also very efficient, by sharing most model parameters between the student and all teachers, where all models are jointly trained in an online manner. UDON also comprises a sampling technique which adapts the training process to dynamically allocate batches to domains which are learned slower and require more frequent processing. This boosts significantly the learning of complex domains which are characterised by a large number of classes and long-tail distributions. With comprehensive experiments, we validate each component of UDON, and showcase significant improvements over the state of the art in the recent UnED benchmark. Code: https://github.com/nikosips/UDON .
Paper Structure (32 sections, 6 equations, 4 figures, 12 tables)

This paper contains 32 sections, 6 equations, 4 figures, 12 tables.

Figures (4)

  • Figure 1: Training of a universal embedding on multiple fine-grained visual domains. The baseline approach of ycc+23 (left) uses classification loss across training classes from all domains. It is prone to cancelling out contradicting cues from different domains. To overcome this issue, a naive multi-teacher distillation approach (middle) first trains one specialized teacher per domain (with a classification loss) to capture domain specifics, then distils them to the universal embedding (student). Our proposed Universal Dynamic Online distillatioN -- UDON (right) jointly trains the specialized teacher embeddings and the universal embedding (student) with classification and, at the same time, distills the teacher embeddings to the universal embedding. Due to joint training of a shared backbone, UDON scales to a large number of domains.
  • Figure 2: Block diagram of UDON's training process. Each batch of size $B$ contains images from a single domain (e.g., cars, natural world, etc). When a batch with domain $i$ is processed, the $i$-th teacher head is used. Both the teacher and the student employ a classification loss ($\mathcal{L}_{\text{cls}}^{t_i}$, $\mathcal{L}_{\text{cls}}^{u}$) on top of their batched logits ($L_{t_i}$, $L_u$), predicting among $C_i$ classes. The student is additionally trained via distillation, by learning intra-batch relationships ($\mathcal{L}_{rel}^{t_i}$) and logits ($\mathcal{L}_{log}^{t_i}$) with the domain teacher guidance. Note that the distillation losses are backpropagated only through the student's head.
  • Figure 3: Qualitative results for our UDON method. We present the 5 nearest neighbours that are retrieved by the baseline (USCRR) embedding (top row) and the proposed UDON embedding (bottom row), for queries that the proposed method improves over the baseline. Each image shows the domain it comes from (underneath it). The correct neighbors are in green border, the incorrect ones are in red.
  • Figure 4: Additional qualitative results for our UDON method. We present nearest neighbours that are retrieved by the baseline (USCRR) embedding (top row) and the proposed UDON embedding (bottom row), for queries that the proposed method improves over the baseline. Each image shows the domain it comes from (underneath it). The correct neighbors are in green border, the incorrect ones are in red. For queries whose class is represented by less than 5 positives in the index, we present as many neighbors as the number of positives, since only those are taken into account for calculating the metrics.