Table of Contents
Fetching ...

A Theoretical Analysis of Soft-Label vs Hard-Label Training in Neural Networks

Saptarshi Mandal, Xiaojun Lin, R. Srikant

TL;DR

The paper addresses why soft-label distillation enables markedly smaller networks than hard-label training. It develops a theoretical analysis of a two-layer ReLU student trained with soft labels drawn from the kernel limit of an infinite-width teacher, using RKHS and neural tangent kernel concepts under projected gradient descent. The authors prove neuron-count bounds showing soft-label training requires about $O(1/(gamma^2 epsilon))$ neurons to drive the average KL loss below epsilon, while hard-label training requires about $O(1/(gamma^4))$ (up to log factors), with the gap widening as the separation gamma becomes small. They validate the theory with experiments on MNIST-derived binary tasks and CIFAR-10 with added noise, showing that soft-label distillation remains advantageous in deeper architectures and harder datasets. Overall, the work provides a principled explanation for the efficiency of knowledge distillation and informs practical model compression under limited capacity.

Abstract

Knowledge distillation, where a small student model learns from a pre-trained large teacher model, has achieved substantial empirical success since the seminal work of \citep{hinton2015distilling}. Despite prior theoretical studies exploring the benefits of knowledge distillation, an important question remains unanswered: why does soft-label training from the teacher require significantly fewer neurons than directly training a small neural network with hard labels? To address this, we first present motivating experimental results using simple neural network models on a binary classification problem. These results demonstrate that soft-label training consistently outperforms hard-label training in accuracy, with the performance gap becoming more pronounced as the dataset becomes increasingly difficult to classify. We then substantiate these observations with a theoretical contribution based on two-layer neural network models. Specifically, we show that soft-label training using gradient descent requires only $O\left(\frac{1}{γ^2 ε}\right)$ neurons to achieve a classification loss averaged over epochs smaller than some $ε> 0$, where $γ$ is the separation margin of the limiting kernel. In contrast, hard-label training requires $O\left(\frac{1}{γ^4} \cdot \ln\left(\frac{1}ε\right)\right)$ neurons, as derived from an adapted version of the gradient descent analysis in \citep{ji2020polylogarithmic}. This implies that when $γ\leq ε$, i.e., when the dataset is challenging to classify, the neuron requirement for soft-label training can be significantly lower than that for hard-label training. Finally, we present experimental results on deep neural networks, further validating these theoretical findings.

A Theoretical Analysis of Soft-Label vs Hard-Label Training in Neural Networks

TL;DR

The paper addresses why soft-label distillation enables markedly smaller networks than hard-label training. It develops a theoretical analysis of a two-layer ReLU student trained with soft labels drawn from the kernel limit of an infinite-width teacher, using RKHS and neural tangent kernel concepts under projected gradient descent. The authors prove neuron-count bounds showing soft-label training requires about neurons to drive the average KL loss below epsilon, while hard-label training requires about (up to log factors), with the gap widening as the separation gamma becomes small. They validate the theory with experiments on MNIST-derived binary tasks and CIFAR-10 with added noise, showing that soft-label distillation remains advantageous in deeper architectures and harder datasets. Overall, the work provides a principled explanation for the efficiency of knowledge distillation and informs practical model compression under limited capacity.

Abstract

Knowledge distillation, where a small student model learns from a pre-trained large teacher model, has achieved substantial empirical success since the seminal work of \citep{hinton2015distilling}. Despite prior theoretical studies exploring the benefits of knowledge distillation, an important question remains unanswered: why does soft-label training from the teacher require significantly fewer neurons than directly training a small neural network with hard labels? To address this, we first present motivating experimental results using simple neural network models on a binary classification problem. These results demonstrate that soft-label training consistently outperforms hard-label training in accuracy, with the performance gap becoming more pronounced as the dataset becomes increasingly difficult to classify. We then substantiate these observations with a theoretical contribution based on two-layer neural network models. Specifically, we show that soft-label training using gradient descent requires only neurons to achieve a classification loss averaged over epochs smaller than some , where is the separation margin of the limiting kernel. In contrast, hard-label training requires neurons, as derived from an adapted version of the gradient descent analysis in \citep{ji2020polylogarithmic}. This implies that when , i.e., when the dataset is challenging to classify, the neuron requirement for soft-label training can be significantly lower than that for hard-label training. Finally, we present experimental results on deep neural networks, further validating these theoretical findings.

Paper Structure

This paper contains 20 sections, 8 theorems, 86 equations, 1 figure, 1 table.

Key Result

theorem 2

Let $\beta \in (0,1)$, $\delta \in \left(0,\frac{1}{3}\right)$ be fixed real numbers. If the number of neurons $m$ satisfies and the PGD algorithm is run with a projection radius $B = 1$ for $T$ iterations such that $T \geq \frac{9}{\beta^2},$ using a constant step size $\eta$ satisfying $\eta \leq \frac{\beta}{3}$. Then, the following bound on the averaged empirical risk holds: with probability

Figures (1)

  • Figure 1: Classification accuracy under Gaussian noise on CIFAR-10 cat/dog with VGG 8+3 as the teacher and VGG 2+3 as the student.

Theorems & Definitions (11)

  • theorem 2
  • lemma 1
  • corollary 1
  • proposition 1
  • lemma 2
  • lemma 3
  • proof
  • lemma 4
  • proof
  • lemma 5
  • ...and 1 more