Multi-perspective Contrastive Logit Distillation
Qi Wang, Jinjia Zhou
TL;DR
This paper revisits knowledge distillation by reframing logit distillation through the semantic lens of raw logits and introduces Multi-perspective Contrastive Logit Distillation (MCLD). MCLD leverages three contrastive perspectives—Instance-wise, Sample-wise, and Category-wise CLD—plus a Target Mask, enabling 1-to-N comparisons without extra feature-mapping modules and using a warm-up schedule to emphasize category-level signals later in training. Empirically, MCLD achieves state-of-the-art performance on CIFAR-100 and ImageNet, often surpassing both logit- and feature-distillation baselines while exhibiting superior training efficiency, and extends effectively to Vision Transformers and transfer tasks. The work highlights the untapped potential of logits for knowledge transfer and provides a practical, scalable framework for robust distillation across diverse architectures and datasets.
Abstract
In previous studies on knowledge distillation, the significance of logit distillation has frequently been overlooked. To revitalize logit distillation, we present a novel perspective by reconsidering its computation based on the semantic properties of logits and exploring how to utilize it more efficiently. Logits often contain a substantial amount of high-level semantic information; however, the conventional approach of employing logits to compute Kullback-Leibler (KL) divergence does not account for their semantic properties. Furthermore, this direct KL divergence computation fails to fully exploit the potential of logits. To address these challenges, we introduce a novel and efficient logit distillation method, Multi-perspective Contrastive Logit Distillation (MCLD), which substantially improves the performance and efficacy of logit distillation. In comparison to existing logit distillation methods and complex feature distillation methods, MCLD attains state-of-the-art performance in image classification, and transfer learning tasks across multiple datasets, including CIFAR-100, ImageNet, Tiny-ImageNet, and STL-10. Additionally, MCLD exhibits superior training efficiency and outstanding performance with distilling on Vision Transformers, further emphasizing its notable advantages. This study unveils the vast potential of logits in knowledge distillation and seeks to offer valuable insights for future research.
