Table of Contents
Fetching ...

Understanding the Gains from Repeated Self-Distillation

Divyansh Pareek, Simon S. Du, Sewoong Oh

TL;DR

This analysis shows that the excess risk achieved by multi-step self-distillation can significantly improve upon a single step of self-distillation, reducing the excess risk by a factor as large as $d$, where $d$ is the input dimension.

Abstract

Self-Distillation is a special type of knowledge distillation where the student model has the same architecture as the teacher model. Despite using the same architecture and the same training data, self-distillation has been empirically observed to improve performance, especially when applied repeatedly. For such a process, there is a fundamental question of interest: How much gain is possible by applying multiple steps of self-distillation? To investigate this relative gain, we propose studying the simple but canonical task of linear regression. Our analysis shows that the excess risk achieved by multi-step self-distillation can significantly improve upon a single step of self-distillation, reducing the excess risk by a factor as large as $d$, where $d$ is the input dimension. Empirical results on regression tasks from the UCI repository show a reduction in the learnt model's risk (MSE) by up to 47%.

Understanding the Gains from Repeated Self-Distillation

TL;DR

This analysis shows that the excess risk achieved by multi-step self-distillation can significantly improve upon a single step of self-distillation, reducing the excess risk by a factor as large as , where is the input dimension.

Abstract

Self-Distillation is a special type of knowledge distillation where the student model has the same architecture as the teacher model. Despite using the same architecture and the same training data, self-distillation has been empirically observed to improve performance, especially when applied repeatedly. For such a process, there is a fundamental question of interest: How much gain is possible by applying multiple steps of self-distillation? To investigate this relative gain, we propose studying the simple but canonical task of linear regression. Our analysis shows that the excess risk achieved by multi-step self-distillation can significantly improve upon a single step of self-distillation, reducing the excess risk by a factor as large as , where is the input dimension. Empirical results on regression tasks from the UCI repository show a reduction in the learnt model's risk (MSE) by up to 47%.
Paper Structure (37 sections, 11 theorems, 73 equations, 10 figures, 1 table)

This paper contains 37 sections, 11 theorems, 73 equations, 10 figures, 1 table.

Key Result

Theorem 1

Under the fixed design linear regression in Assumption assump-LR, there exists a family of problem instances satisfying Assumption assump-FD-sj such that for any instance $(\mathbf{X}, \theta^{\star}, \gamma^2)$ in the family, it holds that where $r:={\rm rank}(\mathbf{X})$, $n$ is the number of samples, $\hat{\theta}(\lambda,\xi^{(r)})$ and $\hat{\theta}(\lambda,\xi)$ are the $r$-step and $1$-st

Figures (10)

  • Figure 1: The standard $1$-step self-distillation defined in Eq. \ref{['onestep-dist-obj']} with parameter $\xi$ and $k$-step self-distillation that repeatedly applies Eq. \ref{['onestep-dist-obj']} with parameter $\xi^{(k)}=[\xi^{(k)}_1,\xi^{(k)}_2,\ldots,\xi^{(k)}_k] \in \mathbb{R}^k$.
  • Figure 2: On a synthetic problem family with dimension $d=100$, noise variance $\gamma=0.1$, and $\theta^{\star}=\mathbf{u}_1$ (agreement with Asmp. \ref{['assump-FD-sj']}.\ref{['assump-FD-sj2']}); we set the singular values of $\mathbf{X}$ with a power law from $s_1=1$ to $s_r=\{0.8, 0.5\}$ (left and right panels) and vary $r={\rm rank}(\mathbf{X})$. Both plots show a linear increase of the relative gain of $r$-step self-distillation in excess risk, i.e. the ratio $A/B$ where $A:= \min_{\lambda > 0} {\rm ExcessRisk} \bigl( \hat{\theta}(\lambda) \bigr)$ and $B:= \min_{\lambda > 0, \xi^{(r)} \in \mathbb{R}^r} {\rm ExcessRisk} \bigl( \hat{\theta}(\lambda, \xi^{(r)}) \bigr)$; demonstrating that $r$-step SD outperforms ridge by a factor of $\Omega(r)$, with the constant inside the $\Omega$ (i.e. slope of the line) changing with the effective condition number, $s_1/s_r$.
  • Figure 3: On a synthetic task (explained in Section \ref{['sec-expts-synth']}), $\mathbf{X}$ has rank $4$ with (a) $\theta^{\star}=\mathbf{u}_1$ and distinct $s_j$'s; (b) $s=[1,1,1/2,1/3]$; (c) $\theta^{\star}=0.5(\mathbf{u}_1+\mathbf{u}_2+\mathbf{u}_3+\mathbf{u}_4)$. Each additional step of SD with optimal choice of $\xi^{(k)}$ reduces ${\rm ExcessRisk}( \hat{\theta}(\lambda, (\xi^{(k)})^\star) )$ for any choice of $\lambda$ on the $x$-axis. Panel (a) satisfies Asmp. \ref{['assump-FD-sj']} and hence $4$-step SD is necessary to achieve the optimal excess risk. This is no longer true when Asmp. \ref{['assump-FD-sj']}.\ref{['assump-FD-sj1']} is violated (b) or Asmp. \ref{['assump-FD-sj']}.\ref{['assump-FD-sj2']} is violated (c). Excess risk achieved by $4$-step SD (i.e. the green lines) in panels (a) and (c) exactly match the numerical value given by RHS of eq. \ref{['eq-thm-sdk-opt-ER-LowerBound']}, i.e. the fundamental lower bound for any SD estimator. But this is not the case in panel (b) [which has the same lower bound from eq. \ref{['eq-thm-sdk-opt-ER-LowerBound']} as panel (a)], because Asmp. \ref{['assump-FD-sj']}.\ref{['assump-FD-sj1']} is violated.
  • Figure 4: On the synthetic problem from Figure \ref{['fig-synth_A']}, we fix $\lambda = 0.125$ and set the singular values of $\mathbf{X}$ as $s_j = \{1 - (j-1) \epsilon\}$, i.e. consecutive values are separated by $\epsilon$. For $k$-step SD with $k=\{1, 2, 3\}$, we plot $(\xi^{(k)})^\star (\lambda)$ (i.e. optimal values of the $\xi$ parameters) by varying $\epsilon \in \{0.2, 0.1, 0.05, 0.02, 0.01\}$. The magnitude of $\xi^{(k)}_k$ values increases as the singular gap $\epsilon$ decreases, verifying Remark \ref{['rem-ill-cond']}.
  • Figure 5: Validation set MSE vs $\lambda$ for three estimators: Ridge, $1$-step SD and $2$-step SD.
  • ...and 5 more figures

Theorems & Definitions (22)

  • Theorem 1
  • Remark 4.1
  • Theorem 2
  • Theorem 3
  • Theorem 4
  • Remark 4.2: Necessity of Assumption \ref{['assump-FD-sj']}.\ref{['assump-FD-sj1']}
  • Lemma 4.1
  • Theorem 5: Informal version of Theorem \ref{['thm-quadraticRisk-formal']} in Appendix \ref{['sec-app-proof-thm-quadraticRisk']}
  • Proposition B.1
  • Lemma D.1
  • ...and 12 more