Table of Contents
Fetching ...

Multi-Level Feature Distillation of Joint Teachers Trained on Distinct Image Datasets

Adrian Iordache, Bogdan Alexe, Radu Tudor Ionescu

TL;DR

It is shown that the novel teacher-student framework to distill knowledge from multiple teachers trained on distinct datasets can significantly surpass equivalent architectures that are either trained on individual datasets, or jointly trained on all datasets at once.

Abstract

We propose a novel teacher-student framework to distill knowledge from multiple teachers trained on distinct datasets. Each teacher is first trained from scratch on its own dataset. Then, the teachers are combined into a joint architecture, which fuses the features of all teachers at multiple representation levels. The joint teacher architecture is fine-tuned on samples from all datasets, thus gathering useful generic information from all data samples. Finally, we employ a multi-level feature distillation procedure to transfer the knowledge to a student model for each of the considered datasets. We conduct image classification experiments on seven benchmarks, and action recognition experiments on three benchmarks. To illustrate the power of our feature distillation procedure, the student architectures are chosen to be identical to those of the individual teachers. To demonstrate the flexibility of our approach, we combine teachers with distinct architectures. We show that our novel Multi-Level Feature Distillation (MLFD) can significantly surpass equivalent architectures that are either trained on individual datasets, or jointly trained on all datasets at once. Furthermore, we confirm that each step of the proposed training procedure is well motivated by a comprehensive ablation study. We publicly release our code at https://github.com/AdrianIordache/MLFD.

Multi-Level Feature Distillation of Joint Teachers Trained on Distinct Image Datasets

TL;DR

It is shown that the novel teacher-student framework to distill knowledge from multiple teachers trained on distinct datasets can significantly surpass equivalent architectures that are either trained on individual datasets, or jointly trained on all datasets at once.

Abstract

We propose a novel teacher-student framework to distill knowledge from multiple teachers trained on distinct datasets. Each teacher is first trained from scratch on its own dataset. Then, the teachers are combined into a joint architecture, which fuses the features of all teachers at multiple representation levels. The joint teacher architecture is fine-tuned on samples from all datasets, thus gathering useful generic information from all data samples. Finally, we employ a multi-level feature distillation procedure to transfer the knowledge to a student model for each of the considered datasets. We conduct image classification experiments on seven benchmarks, and action recognition experiments on three benchmarks. To illustrate the power of our feature distillation procedure, the student architectures are chosen to be identical to those of the individual teachers. To demonstrate the flexibility of our approach, we combine teachers with distinct architectures. We show that our novel Multi-Level Feature Distillation (MLFD) can significantly surpass equivalent architectures that are either trained on individual datasets, or jointly trained on all datasets at once. Furthermore, we confirm that each step of the proposed training procedure is well motivated by a comprehensive ablation study. We publicly release our code at https://github.com/AdrianIordache/MLFD.

Paper Structure

This paper contains 29 sections, 3 equations, 6 figures, 10 tables.

Figures (6)

  • Figure 1: Our multi-level feature distillation (MLFD) framework is based on three stages. In the first stage, individual teachers are trained on each dataset. In the second stage, the individual teachers are merged at a certain representation level ($l_1$) into a joint teacher $\textbf{T}_*$, which comprises levels $l_1$, $l_2$, ..., $l_k$. The joint teacher is trained on all datasets $\textbf{D}_1,\textbf{D}_2,...\textbf{D}_m$, while the individual teachers are kept frozen for efficiency reasons. In the third stage, each student $\textbf{S}_i$ is trained via multi-level feature distillation from the joint teacher $\textbf{T}_*$, for all $i \in \{1,2,...,m\}$. To simplify the visualization, only the first student $\textbf{S}_1$ is illustrated in this figure. Best viewed in color.
  • Figure 2: Top-1 accuracy evolution during the training process for models in $\mathcal{T}_1$. Best viewed in color.
  • Figure 3: Top-1 accuracy during the training process for models in $\mathcal{T}_2$. Best viewed in color.
  • Figure 4: Performance evolution of joint teachers when using different sets of layers (from $\textbf{L}_1$ to $\textbf{L}_4$) to extract features. Best viewed in color.
  • Figure 5: Accuracy rates of the student models on Caltech-101 (left) and Flowers-102 (right) when the number of datasets is increased from one to four. Best viewed in color.
  • ...and 1 more figures