Table of Contents
Fetching ...

Quantifying Knowledge Distillation Using Partial Information Decomposition

Pasan Dissanayake, Faisal Hamman, Barproda Halder, Ilia Sucholutsky, Qiuyi Zhang, Sanghamitra Dutta

TL;DR

This work addresses fundamental limits of knowledge distillation by applying Partial Information Decomposition to separate task-relevant knowledge from nuisance information encoded in teacher representations. It defines the knowledge to distill as $Uni(Y:T\setminus S)$ and the transferred knowledge as $Red(Y:T,S)$, and introduces the RID framework that optimizes a lower bound on transferred knowledge via $Red_\cap (Y: T,S)$. The authors show that approaches maximizing $I(T;S)$ can misallocate capacity for limited students, and demonstrate RID's robustness to nuisance teachers through theory and two-phase optimization, with extensive empirical validation on CIFAR-10/100 and an ImageNet-to-CUB transfer scenario. Overall, the paper provides a rigorous information-theoretic lens and a practical algorithm for task-focused distillation, with implications for more resilient model compression and transfer learning.

Abstract

Knowledge distillation deploys complex machine learning models in resource-constrained environments by training a smaller student model to emulate internal representations of a complex teacher model. However, the teacher's representations can also encode nuisance or additional information not relevant to the downstream task. Distilling such irrelevant information can actually impede the performance of a capacity-limited student model. This observation motivates our primary question: What are the information-theoretic limits of knowledge distillation? To this end, we leverage Partial Information Decomposition to quantify and explain the transferred knowledge and knowledge left to distill for a downstream task. We theoretically demonstrate that the task-relevant transferred knowledge is succinctly captured by the measure of redundant information about the task between the teacher and student. We propose a novel multi-level optimization to incorporate redundant information as a regularizer, leading to our framework of Redundant Information Distillation (RID). RID leads to more resilient and effective distillation under nuisance teachers as it succinctly quantifies task-relevant knowledge rather than simply aligning student and teacher representations.

Quantifying Knowledge Distillation Using Partial Information Decomposition

TL;DR

This work addresses fundamental limits of knowledge distillation by applying Partial Information Decomposition to separate task-relevant knowledge from nuisance information encoded in teacher representations. It defines the knowledge to distill as and the transferred knowledge as , and introduces the RID framework that optimizes a lower bound on transferred knowledge via . The authors show that approaches maximizing can misallocate capacity for limited students, and demonstrate RID's robustness to nuisance teachers through theory and two-phase optimization, with extensive empirical validation on CIFAR-10/100 and an ImageNet-to-CUB transfer scenario. Overall, the paper provides a rigorous information-theoretic lens and a practical algorithm for task-focused distillation, with implications for more resilient model compression and transfer learning.

Abstract

Knowledge distillation deploys complex machine learning models in resource-constrained environments by training a smaller student model to emulate internal representations of a complex teacher model. However, the teacher's representations can also encode nuisance or additional information not relevant to the downstream task. Distilling such irrelevant information can actually impede the performance of a capacity-limited student model. This observation motivates our primary question: What are the information-theoretic limits of knowledge distillation? To this end, we leverage Partial Information Decomposition to quantify and explain the transferred knowledge and knowledge left to distill for a downstream task. We theoretically demonstrate that the task-relevant transferred knowledge is succinctly captured by the measure of redundant information about the task between the teacher and student. We propose a novel multi-level optimization to incorporate redundant information as a regularizer, leading to our framework of Redundant Information Distillation (RID). RID leads to more resilient and effective distillation under nuisance teachers as it succinctly quantifies task-relevant knowledge rather than simply aligning student and teacher representations.

Paper Structure

This paper contains 20 sections, 8 theorems, 28 equations, 5 figures, 3 tables, 1 algorithm.

Key Result

Theorem 3.1

Let $T=(Z, G)$ where $Z$ contains all the task-related information (i.e., $I_{}(Y;T)=I_{}(Y;Z)$) and $G$ does not contain any information about the task (i.e., $I_{}(Y;G)=0$). Let the student be a capacity-limited model as defined by $H(S) \leq \max\{H(Z), H(G)\}$ where $H(X)$ denotes the Shannon en

Figures (5)

  • Figure 1: Knowledge Distillation: The teacher (a complex model) assists the student (usually a substantially simpler model) during their training. The learned student can perform much better than an independently trained student without distillation with a similar training setup (i.e., hyper-parameters and data). The teacher may or may not have been trained for the same task as the student.
  • Figure 2: Partial Information Decomposition: The box denotes the total joint mutual information $I_{}(Y;T,S)$, which is decomposed into four non-negative terms named synergistic information $Syn(Y:T,S)$, redundant information $Red(Y:T,S)$ and the two unique information terms $Uni(Y:T \backslash S)$ and $Uni(Y:S \backslash T)$.
  • Figure 3: Redundant Information Distillation (RID) framework: $f_t(\cdot)$ and $f_s(\cdot)$ are the teacher and the student filter outputs, respectively. A classification head $g_t(\cdot)$ is appended to the teacher's filter during the first phase. Highlighted in amber are the components that are being updated in each phase.
  • Figure 4: Classification accuracy for CIFAR-10 dataset of RID, VID, TED and BAS when distilled using a trained (abbreviated "tt"--solid lines) and an untrained (abbreviated "ut"--dashed lines) teacher: The solid and dashed lines indicate the mean over three runs. Shaded areas represent the corresponding confidence regions: mean $\pm$ standard deviation. Colors correspond to the distillation method used.
  • Figure 5: Information atoms of $I_{}(Y;T,S)$ for BAS, VID and RID when distilled using a trained and an untrained teacher: Values are shown for the innermost distilled layer. The first two rows show that when distilled from a trained teacher, the remaining amount of knowledge available in the teacher for distillation $Uni(Y:T \backslash S)$ decreases, whereas the already transferred knowledge $Red(Y:T,S)$ increases. Observe from the third row how VID performs worse than both RID and BAS when the teacher is not trained.

Theorems & Definitions (17)

  • Definition 3.1: Knowledge to distill
  • Definition 3.2: Transferred knowledge
  • Definition 3.3: Unique and redundant information bertschingerUniqueInfo
  • Theorem 3.1: Teacher with nuisance
  • Theorem 3.2: Properties
  • Definition 4.1: $I_\alpha$ measure griffithRedInfo
  • Theorem 4.1: Transferred knowledge lower bound
  • Remark
  • Theorem B.1: Teacher with nuisance
  • proof
  • ...and 7 more