Table of Contents
Fetching ...

Linear Projections of Teacher Embeddings for Few-Class Distillation

Noel Loo, Fotis Iliopoulos, Wei Hu, Erik Vee

TL;DR

A novel method for distilling knowledge from the teacher's model representations, which is Inspired by recent findings about the structure of final-layer representations, LELP works by identifying informative linear subspaces in the teacher's embedding space, and splitting them into pseudo-subclasses.

Abstract

Knowledge Distillation (KD) has emerged as a promising approach for transferring knowledge from a larger, more complex teacher model to a smaller student model. Traditionally, KD involves training the student to mimic the teacher's output probabilities, while more advanced techniques have explored guiding the student to adopt the teacher's internal representations. Despite its widespread success, the performance of KD in binary classification and few-class problems has been less satisfactory. This is because the information about the teacher model's generalization patterns scales directly with the number of classes. Moreover, several sophisticated distillation methods may not be universally applicable or effective for data types beyond Computer Vision. Consequently, effective distillation techniques remain elusive for a range of key real-world applications, such as sentiment analysis, search query understanding, and advertisement-query relevance assessment. Taking these observations into account, we introduce a novel method for distilling knowledge from the teacher's model representations, which we term Learning Embedding Linear Projections (LELP). Inspired by recent findings about the structure of final-layer representations, LELP works by identifying informative linear subspaces in the teacher's embedding space, and splitting them into pseudo-subclasses. The student model is then trained to replicate these pseudo-classes. Our experimental evaluation on large-scale NLP benchmarks like Amazon Reviews and Sentiment140 demonstrate the LELP is consistently competitive with, and typically superior to, existing state-of-the-art distillation algorithms for binary and few-class problems, where most KD methods suffer.

Linear Projections of Teacher Embeddings for Few-Class Distillation

TL;DR

A novel method for distilling knowledge from the teacher's model representations, which is Inspired by recent findings about the structure of final-layer representations, LELP works by identifying informative linear subspaces in the teacher's embedding space, and splitting them into pseudo-subclasses.

Abstract

Knowledge Distillation (KD) has emerged as a promising approach for transferring knowledge from a larger, more complex teacher model to a smaller student model. Traditionally, KD involves training the student to mimic the teacher's output probabilities, while more advanced techniques have explored guiding the student to adopt the teacher's internal representations. Despite its widespread success, the performance of KD in binary classification and few-class problems has been less satisfactory. This is because the information about the teacher model's generalization patterns scales directly with the number of classes. Moreover, several sophisticated distillation methods may not be universally applicable or effective for data types beyond Computer Vision. Consequently, effective distillation techniques remain elusive for a range of key real-world applications, such as sentiment analysis, search query understanding, and advertisement-query relevance assessment. Taking these observations into account, we introduce a novel method for distilling knowledge from the teacher's model representations, which we term Learning Embedding Linear Projections (LELP). Inspired by recent findings about the structure of final-layer representations, LELP works by identifying informative linear subspaces in the teacher's embedding space, and splitting them into pseudo-subclasses. The student model is then trained to replicate these pseudo-classes. Our experimental evaluation on large-scale NLP benchmarks like Amazon Reviews and Sentiment140 demonstrate the LELP is consistently competitive with, and typically superior to, existing state-of-the-art distillation algorithms for binary and few-class problems, where most KD methods suffer.
Paper Structure (29 sections, 7 equations, 6 figures, 13 tables, 2 algorithms)

This paper contains 29 sections, 7 equations, 6 figures, 13 tables, 2 algorithms.

Figures (6)

  • Figure 1: Learning with Embedding Projections (LELP) teaches students about subclass structure (shown by $v_{c,i}$ on (b)) via linear projections. As seen by (a), LELP outperforms existing algorithms over large scale real-world NLP datasets such as Amazon Reviews (5 classes, 500k examples) and Sentiment140 (binary, 1.6 million examples) achieving an improvement of 1.85% and 0.88%, respectively, over the best baseline. In fact, in the former case, the LELP-trained student outperforms even the teacher, which contains over 20x the number of parameters.
  • Figure 2: Schematic of the Learning with Embedding Projections (LELP) algorithm. LELP decomposes Teacher predictions into subclasses via a PCA decomposition, and trains a student on these subclasses. For predictions, subclasses are summed together back into their original classes.
  • Figure 3: The effectiveness of different clustering techniques for creating pseudo-subclasses during knowledge distillation from a ResNet-92 teacher to (a) a ResNet-56 and (b) a MobileNet student on the binarized CIFAR-100 dataset is presented. "Oracle Clustering", where subclass structure is known a priori, serves as an upper bound and notably surpasses all other methods, even exceeding the teacher's performance in the ResNet-56 case. Among practical methods (i.e., those discovering subclass structure), LELP exhibits superior performance. Agglomerative and K-means clustering do not consistently outperform vanilla knowledge distillation, demonstrating the dependence of pseudo-subclass effectiveness on the chosen clustering algorithm.
  • Figure 5: Top row: Ablations of choice of Projection, number of subclasses $S$ and subclass tempereature $\beta$ on CIFAR-100bin. The set of plots displays raw CIFAR-100bin accuracy. Bottom row: The second set of plots demonstrates the accuracy gain achieved over random projections (using the same hyperparameter choice). Values over 0 indicate an advantage over random projections, which we see consistently with LELP.
  • Figure 6: Experiments on the binary and standard CIFAR-100 datasets using ResNet92 as the teacher and ResNet56 and MobileNet, respectively, as the student. Left: Distillation Dataset Size vs. Accuracy on binary CIFAR-100. LELP achieves the same performance as standard training while using only $25\%$ of the data. Middle: Student's validation accuracy over the training trajectory. LELP offers both performance gains over Vanilla KD and a faster convergence rate. Right: Illustration of the performance of LELP in a semi-supervised setting. The x-axis shows the initial quantity of labeled examples used to train the teacher model, which then generates pseudo-labels for the remaining (unlabeled) portion of the CIFAR-100 dataset. See Appendix \ref{['app:semi']} for more details.
  • ...and 1 more figures