A Theoretical Analysis of Soft-Label vs Hard-Label Training in Neural Networks
Saptarshi Mandal, Xiaojun Lin, R. Srikant
TL;DR
The paper addresses why soft-label distillation enables markedly smaller networks than hard-label training. It develops a theoretical analysis of a two-layer ReLU student trained with soft labels drawn from the kernel limit of an infinite-width teacher, using RKHS and neural tangent kernel concepts under projected gradient descent. The authors prove neuron-count bounds showing soft-label training requires about $O(1/(gamma^2 epsilon))$ neurons to drive the average KL loss below epsilon, while hard-label training requires about $O(1/(gamma^4))$ (up to log factors), with the gap widening as the separation gamma becomes small. They validate the theory with experiments on MNIST-derived binary tasks and CIFAR-10 with added noise, showing that soft-label distillation remains advantageous in deeper architectures and harder datasets. Overall, the work provides a principled explanation for the efficiency of knowledge distillation and informs practical model compression under limited capacity.
Abstract
Knowledge distillation, where a small student model learns from a pre-trained large teacher model, has achieved substantial empirical success since the seminal work of \citep{hinton2015distilling}. Despite prior theoretical studies exploring the benefits of knowledge distillation, an important question remains unanswered: why does soft-label training from the teacher require significantly fewer neurons than directly training a small neural network with hard labels? To address this, we first present motivating experimental results using simple neural network models on a binary classification problem. These results demonstrate that soft-label training consistently outperforms hard-label training in accuracy, with the performance gap becoming more pronounced as the dataset becomes increasingly difficult to classify. We then substantiate these observations with a theoretical contribution based on two-layer neural network models. Specifically, we show that soft-label training using gradient descent requires only $O\left(\frac{1}{γ^2 ε}\right)$ neurons to achieve a classification loss averaged over epochs smaller than some $ε> 0$, where $γ$ is the separation margin of the limiting kernel. In contrast, hard-label training requires $O\left(\frac{1}{γ^4} \cdot \ln\left(\frac{1}ε\right)\right)$ neurons, as derived from an adapted version of the gradient descent analysis in \citep{ji2020polylogarithmic}. This implies that when $γ\leq ε$, i.e., when the dataset is challenging to classify, the neuron requirement for soft-label training can be significantly lower than that for hard-label training. Finally, we present experimental results on deep neural networks, further validating these theoretical findings.
