Table of Contents
Fetching ...

Learning from Matured Dumb Teacher for Fine Generalization

HeeSeung Jung, Kangil Kim, Hoyong Kim, Jong-Hun Shin

TL;DR

This paper proposes matured dumb teacher based KD, conservatively transferring the hypothesis for generalization of the student without massive destruction of trained information, and results imply that the proposed method can provide finer generalization than existing methods.

Abstract

The flexibility of decision boundaries in neural networks that are unguided by training data is a well-known problem typically resolved with generalization methods. A surprising result from recent knowledge distillation (KD) literature is that random, untrained, and equally structured teacher networks can also vastly improve generalization performance. It raises the possibility of existence of undiscovered assumptions useful for generalization on an uncertain region. In this paper, we shed light on the assumptions by analyzing decision boundaries and confidence distributions of both simple and KD-based generalization methods. Assuming that a decision boundary exists to represent the most general tendency of distinction on an input sample space (i.e., the simplest hypothesis), we show the various limitations of methods when using the hypothesis. To resolve these limitations, we propose matured dumb teacher based KD, conservatively transferring the hypothesis for generalization of the student without massive destruction of trained information. In practical experiments on feed-forward and convolution neural networks for image classification tasks on MNIST, CIFAR-10, and CIFAR-100 datasets, the proposed method shows stable improvement to the best test performance in the grid search of hyperparameters. The analysis and results imply that the proposed method can provide finer generalization than existing methods.

Learning from Matured Dumb Teacher for Fine Generalization

TL;DR

This paper proposes matured dumb teacher based KD, conservatively transferring the hypothesis for generalization of the student without massive destruction of trained information, and results imply that the proposed method can provide finer generalization than existing methods.

Abstract

The flexibility of decision boundaries in neural networks that are unguided by training data is a well-known problem typically resolved with generalization methods. A surprising result from recent knowledge distillation (KD) literature is that random, untrained, and equally structured teacher networks can also vastly improve generalization performance. It raises the possibility of existence of undiscovered assumptions useful for generalization on an uncertain region. In this paper, we shed light on the assumptions by analyzing decision boundaries and confidence distributions of both simple and KD-based generalization methods. Assuming that a decision boundary exists to represent the most general tendency of distinction on an input sample space (i.e., the simplest hypothesis), we show the various limitations of methods when using the hypothesis. To resolve these limitations, we propose matured dumb teacher based KD, conservatively transferring the hypothesis for generalization of the student without massive destruction of trained information. In practical experiments on feed-forward and convolution neural networks for image classification tasks on MNIST, CIFAR-10, and CIFAR-100 datasets, the proposed method shows stable improvement to the best test performance in the grid search of hyperparameters. The analysis and results imply that the proposed method can provide finer generalization than existing methods.

Paper Structure

This paper contains 38 sections, 1 theorem, 6 equations, 7 figures, 3 tables.

Key Result

Theorem 1

In any plateau of a good local optimum, $M_{h}$, $\ell_1$ and $\ell_2$ penalization do not guarantee to move toward a worse but simple local optimum $M_{l}$ which indicates the simplest hypothesis. $\nabla_\theta\ell_1=\vec{1}$ and $\nabla_\theta\ell_2=\theta$ for all $M$. $M_l$ varies by training d

Figures (7)

  • Figure 1: Update directions in the uncertain region on the training loss landscape. The black point is the trained student model without generalization. The flat optima is the uncertain region of generalization. (a) shows the data-agnostic bias of $\ell_1$ and $\ell_2$ penalization, and (b) shows an example of the different directions of teacher models of the sKD methods. ($\theta_i$: model parameters).
  • Figure 2: Decision boundary and confidence distribution of dropout of various strength. $p$ is the dropout probability to turn off a node. Red and black background colors are predictions for each class. White tubes represent confidence. The lighter color is less confident.
  • Figure 3: Confidence change of LS over seen and unseen areas in the sample space with respect to the interpolate rate. $n$ LS: LS of interpolate rate $n$ with uniform distribution.
  • Figure 4: Process of the proposed method for generalization (y: ground truth; x: input sample; $\alpha$: hyperparameter of KD to imply transferring strength; Teacher: a neural network in the same architecture to Student)
  • Figure 5: Decision boundary and confidence distribution of simple generalization (left three columns) and sKD-based generalization (right two columns) in the toy binary classification problem according to generalization strength. The red and blue dots are samples, and the background is a set of predictions of the location. Training samples are inside the black square and test samples are outside ($\lambda$: penalty scale; $p$:dropout probability; $\alpha$: interpolation rate of sKD in Equation (\ref{['eq:skd']}); $\epsilon$: interpolation rate of LS in Equation (\ref{['eq:ls']})).
  • ...and 2 more figures

Theorems & Definitions (1)

  • Theorem 1