Deep Mutual Learning

Ying Zhang; Tao Xiang; Timothy M. Hospedales; Huchuan Lu

Deep Mutual Learning

Ying Zhang, Tao Xiang, Timothy M. Hospedales, Huchuan Lu

TL;DR

The paper introduces Deep Mutual Learning (DML), where a cohort of networks learns simultaneously and mutualizes knowledge through a KL-divergence based mimicry loss between peers, removing the need for a pre-trained teacher. It extends to K networks and demonstrates that DML improves performance across CIFAR-100 and Market-1501, with smaller models gaining the most and large models also benefiting. The authors show that DML can outperform traditional distillation and that larger cohorts yield better generalisation, partly by increasing posterior entropy and fostering robust, wide minima. The approach is simple, general, and beneficial for producing compact, fast models and enhancing ensemble performance with minimal overhead.

Abstract

Model distillation is an effective and widely used technique to transfer knowledge from a teacher to a student network. The typical application is to transfer from a powerful large network or ensemble to a small network, that is better suited to low-memory or fast execution requirements. In this paper, we present a deep mutual learning (DML) strategy where, rather than one way transfer between a static pre-defined teacher and a student, an ensemble of students learn collaboratively and teach each other throughout the training process. Our experiments show that a variety of network architectures benefit from mutual learning and achieve compelling results on CIFAR-100 recognition and Market-1501 person re-identification benchmarks. Surprisingly, it is revealed that no prior powerful teacher network is necessary -- mutual learning of a collection of simple student networks works, and moreover outperforms distillation from a more powerful yet static teacher.

Deep Mutual Learning

TL;DR

Abstract

Deep Mutual Learning

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (4)