Table of Contents
Fetching ...

Cooperative Knowledge Distillation: A Learner Agnostic Approach

Michael Livanos, Ian Davidson, Stephen Wong

TL;DR

The paper addresses the limitations of traditional, single-teacher knowledge distillation by proposing cooperative distillation, a learner-agnostic framework where multiple models act as both teachers and students. It transfers only the relevant knowledge by generating quintessential counterfactuals targeted to each model's deficiencies, enabling cross-architecture and cross-feature-space distillation without sharing raw data. The method is formalized through deficiency-aware sets $S_i$ and $R_{i \rightarrow j}$, and counterfactual generation via $y' = f_i(x) + \alpha(y - f_i(x))$ with $\alpha = 0.5$, achieving strong improvements across diverse experiments and settings. This approach is particularly impactful for privacy-sensitive domains, offering scalable, model-agnostic distillation when data sharing is restricted yet model exchange is feasible.

Abstract

Knowledge distillation is a simple but powerful way to transfer knowledge between a teacher model to a student model. Existing work suffers from at least one of the following key limitations in terms of direction and scope of transfer which restrict its use: all knowledge is transferred from teacher to student regardless of whether or not that knowledge is useful, the student is the only one learning in this exchange, and typically distillation transfers knowledge only from a single teacher to a single student. We formulate a novel form of knowledge distillation in which many models can act as both students and teachers which we call cooperative distillation. The models cooperate as follows: a model (the student) identifies specific deficiencies in it's performance and searches for another model (the teacher) who encodes learned knowledge into instructional virtual instances via counterfactual instance generation. Because different models may have different strengths and weaknesses, all models can act as either students or teachers (cooperation) when appropriate and only distill knowledge in areas specific to their strengths (focus). Since counterfactuals as a paradigm are not tied to any specific algorithm, we can use this method to distill knowledge between learners of different architectures, algorithms, and even feature spaces. We demonstrate that our approach not only outperforms baselines such as transfer learning, self-supervised learning, and multiple knowledge distillation algorithms on several datasets, but it can also be used in settings where the aforementioned techniques cannot.

Cooperative Knowledge Distillation: A Learner Agnostic Approach

TL;DR

The paper addresses the limitations of traditional, single-teacher knowledge distillation by proposing cooperative distillation, a learner-agnostic framework where multiple models act as both teachers and students. It transfers only the relevant knowledge by generating quintessential counterfactuals targeted to each model's deficiencies, enabling cross-architecture and cross-feature-space distillation without sharing raw data. The method is formalized through deficiency-aware sets and , and counterfactual generation via with , achieving strong improvements across diverse experiments and settings. This approach is particularly impactful for privacy-sensitive domains, offering scalable, model-agnostic distillation when data sharing is restricted yet model exchange is feasible.

Abstract

Knowledge distillation is a simple but powerful way to transfer knowledge between a teacher model to a student model. Existing work suffers from at least one of the following key limitations in terms of direction and scope of transfer which restrict its use: all knowledge is transferred from teacher to student regardless of whether or not that knowledge is useful, the student is the only one learning in this exchange, and typically distillation transfers knowledge only from a single teacher to a single student. We formulate a novel form of knowledge distillation in which many models can act as both students and teachers which we call cooperative distillation. The models cooperate as follows: a model (the student) identifies specific deficiencies in it's performance and searches for another model (the teacher) who encodes learned knowledge into instructional virtual instances via counterfactual instance generation. Because different models may have different strengths and weaknesses, all models can act as either students or teachers (cooperation) when appropriate and only distill knowledge in areas specific to their strengths (focus). Since counterfactuals as a paradigm are not tied to any specific algorithm, we can use this method to distill knowledge between learners of different architectures, algorithms, and even feature spaces. We demonstrate that our approach not only outperforms baselines such as transfer learning, self-supervised learning, and multiple knowledge distillation algorithms on several datasets, but it can also be used in settings where the aforementioned techniques cannot.
Paper Structure (9 sections, 2 equations, 7 figures, 1 table, 1 algorithm)

This paper contains 9 sections, 2 equations, 7 figures, 1 table, 1 algorithm.

Figures (7)

  • Figure 1: Pipeline for our method. Each model and it's corresponding datasets are color coded. Given $k + 1$ models $f_0$ through $f_k$ and datasets $X_0$ through $X_k$, we find which instances among the training sets each model is able to correctly predict (rows colored in green), creating a set of indices for each model (step 1). For all permutations of groups of two of these sets ($S_i$, $S_j$), we find $R_{i \rightarrow j} = S_i - S_j$: instances for which model $i$ is a teacher for model $j$ (rows highlighted in yellow, step 2). We then create counterfactuals using the appropriate teacher models and instances, labeled $X'_{teacher \rightarrow student}$ (step 3), shuffle the new, augmented instances into the training sets of each group, and retrain the models to create augmented models $f'_0$ to $f'_k$ (step 4).
  • Figure 2: Non-Data Sharing Scenario: Our approach for deploying the approach while maintaining data privacy. Institutions may share models, but not data. Models are exchanged (step 1), our technique is applied to create a subset of the virtual instances (step 2), those virtual instances are shared, and the models are retrained (step 3).
  • Figure 3: Quintessential counterfactual generation illustrative example. Each model's decision surface is a contour map with each circle representing 20% confidence of the correct class. The orange model (teacher) predicts instance $x$ with 60% confidence, and the blue model (student) mispredicts $x$ as it's confidence in the correct class is only 40%. The teacher model creates a virtual instance $x'$ it believes is more typical of the class.
  • Figure 4: Heatmap for the knowledge distilled into different models of Experiment 4. Columns are features, and the rows represent the average change (yellow is positive, blue is negative) to move an instance to the diseased class. For example, the bottom left tile shows an increase in age (first column) is associated with heart disease whilst the middle top row indicates a reduction in ST-Depression (ninth row) decreases heart disease.
  • Figure 5: Heatmap to visualize the number of counterfactuals distilled to each model. Rows are students and columns teachers, with yellow implying more instances and blue less.
  • ...and 2 more figures