Table of Contents
Fetching ...

Deep Mutual Learning

Ying Zhang, Tao Xiang, Timothy M. Hospedales, Huchuan Lu

TL;DR

The paper introduces Deep Mutual Learning (DML), where a cohort of networks learns simultaneously and mutualizes knowledge through a KL-divergence based mimicry loss between peers, removing the need for a pre-trained teacher. It extends to K networks and demonstrates that DML improves performance across CIFAR-100 and Market-1501, with smaller models gaining the most and large models also benefiting. The authors show that DML can outperform traditional distillation and that larger cohorts yield better generalisation, partly by increasing posterior entropy and fostering robust, wide minima. The approach is simple, general, and beneficial for producing compact, fast models and enhancing ensemble performance with minimal overhead.

Abstract

Model distillation is an effective and widely used technique to transfer knowledge from a teacher to a student network. The typical application is to transfer from a powerful large network or ensemble to a small network, that is better suited to low-memory or fast execution requirements. In this paper, we present a deep mutual learning (DML) strategy where, rather than one way transfer between a static pre-defined teacher and a student, an ensemble of students learn collaboratively and teach each other throughout the training process. Our experiments show that a variety of network architectures benefit from mutual learning and achieve compelling results on CIFAR-100 recognition and Market-1501 person re-identification benchmarks. Surprisingly, it is revealed that no prior powerful teacher network is necessary -- mutual learning of a collection of simple student networks works, and moreover outperforms distillation from a more powerful yet static teacher.

Deep Mutual Learning

TL;DR

The paper introduces Deep Mutual Learning (DML), where a cohort of networks learns simultaneously and mutualizes knowledge through a KL-divergence based mimicry loss between peers, removing the need for a pre-trained teacher. It extends to K networks and demonstrates that DML improves performance across CIFAR-100 and Market-1501, with smaller models gaining the most and large models also benefiting. The authors show that DML can outperform traditional distillation and that larger cohorts yield better generalisation, partly by increasing posterior entropy and fostering robust, wide minima. The approach is simple, general, and beneficial for producing compact, fast models and enhancing ensemble performance with minimal overhead.

Abstract

Model distillation is an effective and widely used technique to transfer knowledge from a teacher to a student network. The typical application is to transfer from a powerful large network or ensemble to a small network, that is better suited to low-memory or fast execution requirements. In this paper, we present a deep mutual learning (DML) strategy where, rather than one way transfer between a static pre-defined teacher and a student, an ensemble of students learn collaboratively and teach each other throughout the training process. Our experiments show that a variety of network architectures benefit from mutual learning and achieve compelling results on CIFAR-100 recognition and Market-1501 person re-identification benchmarks. Surprisingly, it is revealed that no prior powerful teacher network is necessary -- mutual learning of a collection of simple student networks works, and moreover outperforms distillation from a more powerful yet static teacher.

Paper Structure

This paper contains 13 sections, 7 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Deep Mutual Learning (DML) schematic. Each network is trained with a supervised learning loss, and a KLD-based mimcry loss to match the probability estimates of its peers.
  • Figure 2: Performance (mAP (%)) on Market-1501 with different numbers of networks in cohort
  • Figure 3: Analysis on why DML works
  • Figure 4: Comparison of DML with each individual peer student as teacher and DML with peer student ensemble as teacher (DML_e) with 5 MobileNets trained on Market-1501