Quantifying Knowledge Distillation Using Partial Information Decomposition
Pasan Dissanayake, Faisal Hamman, Barproda Halder, Ilia Sucholutsky, Qiuyi Zhang, Sanghamitra Dutta
TL;DR
This work addresses fundamental limits of knowledge distillation by applying Partial Information Decomposition to separate task-relevant knowledge from nuisance information encoded in teacher representations. It defines the knowledge to distill as $Uni(Y:T\setminus S)$ and the transferred knowledge as $Red(Y:T,S)$, and introduces the RID framework that optimizes a lower bound on transferred knowledge via $Red_\cap (Y: T,S)$. The authors show that approaches maximizing $I(T;S)$ can misallocate capacity for limited students, and demonstrate RID's robustness to nuisance teachers through theory and two-phase optimization, with extensive empirical validation on CIFAR-10/100 and an ImageNet-to-CUB transfer scenario. Overall, the paper provides a rigorous information-theoretic lens and a practical algorithm for task-focused distillation, with implications for more resilient model compression and transfer learning.
Abstract
Knowledge distillation deploys complex machine learning models in resource-constrained environments by training a smaller student model to emulate internal representations of a complex teacher model. However, the teacher's representations can also encode nuisance or additional information not relevant to the downstream task. Distilling such irrelevant information can actually impede the performance of a capacity-limited student model. This observation motivates our primary question: What are the information-theoretic limits of knowledge distillation? To this end, we leverage Partial Information Decomposition to quantify and explain the transferred knowledge and knowledge left to distill for a downstream task. We theoretically demonstrate that the task-relevant transferred knowledge is succinctly captured by the measure of redundant information about the task between the teacher and student. We propose a novel multi-level optimization to incorporate redundant information as a regularizer, leading to our framework of Redundant Information Distillation (RID). RID leads to more resilient and effective distillation under nuisance teachers as it succinctly quantifies task-relevant knowledge rather than simply aligning student and teacher representations.
