Table of Contents
Fetching ...

Local Dense Logit Relations for Enhanced Knowledge Distillation

Liuchi Xu, Kang Liu, Jinshuai Liu, Lu Wang, Lisheng Xu, Jun Cheng

TL;DR

This paper tackles the challenge of transferring fine-grained inter-class knowledge in knowledge distillation by introducing Local Dense Relational Logit Distillation (LDRLD). LDRLD recursively decouples and recombines logit information to create dense, informative inter-class relationships, and augments this with Adaptive Decay Weighting (ADW) using Inverse Rank Weighting (IRW) and Exponential Rank Decay (ERD) to emphasize closely related categories. It also distills remaining non-target knowledge to ensure completeness, combining these components into a total loss that improves student performance across CIFAR-100, Tiny-ImageNet, and ImageNet-1K, with demonstrations in object detection and fine-grained tasks. The approach yields consistent improvements over state-of-the-art logit-based KD methods, provides robust generalization, and is supported by visualization analyses showing closer teacher-student logit alignment and more targeted attention.

Abstract

State-of-the-art logit distillation methods exhibit versatility, simplicity, and efficiency. Despite the advances, existing studies have yet to delve thoroughly into fine-grained relationships within logit knowledge. In this paper, we propose Local Dense Relational Logit Distillation (LDRLD), a novel method that captures inter-class relationships through recursively decoupling and recombining logit information, thereby providing more detailed and clearer insights for student learning. To further optimize the performance, we introduce an Adaptive Decay Weight (ADW) strategy, which can dynamically adjust the weights for critical category pairs using Inverse Rank Weighting (IRW) and Exponential Rank Decay (ERD). Specifically, IRW assigns weights inversely proportional to the rank differences between pairs, while ERD adaptively controls weight decay based on total ranking scores of category pairs. Furthermore, after the recursive decoupling, we distill the remaining non-target knowledge to ensure knowledge completeness and enhance performance. Ultimately, our method improves the student's performance by transferring fine-grained knowledge and emphasizing the most critical relationships. Extensive experiments on datasets such as CIFAR-100, ImageNet-1K, and Tiny-ImageNet demonstrate that our method compares favorably with state-of-the-art logit-based distillation approaches. The code will be made publicly available.

Local Dense Logit Relations for Enhanced Knowledge Distillation

TL;DR

This paper tackles the challenge of transferring fine-grained inter-class knowledge in knowledge distillation by introducing Local Dense Relational Logit Distillation (LDRLD). LDRLD recursively decouples and recombines logit information to create dense, informative inter-class relationships, and augments this with Adaptive Decay Weighting (ADW) using Inverse Rank Weighting (IRW) and Exponential Rank Decay (ERD) to emphasize closely related categories. It also distills remaining non-target knowledge to ensure completeness, combining these components into a total loss that improves student performance across CIFAR-100, Tiny-ImageNet, and ImageNet-1K, with demonstrations in object detection and fine-grained tasks. The approach yields consistent improvements over state-of-the-art logit-based KD methods, provides robust generalization, and is supported by visualization analyses showing closer teacher-student logit alignment and more targeted attention.

Abstract

State-of-the-art logit distillation methods exhibit versatility, simplicity, and efficiency. Despite the advances, existing studies have yet to delve thoroughly into fine-grained relationships within logit knowledge. In this paper, we propose Local Dense Relational Logit Distillation (LDRLD), a novel method that captures inter-class relationships through recursively decoupling and recombining logit information, thereby providing more detailed and clearer insights for student learning. To further optimize the performance, we introduce an Adaptive Decay Weight (ADW) strategy, which can dynamically adjust the weights for critical category pairs using Inverse Rank Weighting (IRW) and Exponential Rank Decay (ERD). Specifically, IRW assigns weights inversely proportional to the rank differences between pairs, while ERD adaptively controls weight decay based on total ranking scores of category pairs. Furthermore, after the recursive decoupling, we distill the remaining non-target knowledge to ensure knowledge completeness and enhance performance. Ultimately, our method improves the student's performance by transferring fine-grained knowledge and emphasizing the most critical relationships. Extensive experiments on datasets such as CIFAR-100, ImageNet-1K, and Tiny-ImageNet demonstrate that our method compares favorably with state-of-the-art logit-based distillation approaches. The code will be made publicly available.

Paper Structure

This paper contains 34 sections, 19 equations, 5 figures, 23 tables, 1 algorithm.

Figures (5)

  • Figure 1: (a) Hinton et al. hinton2015distilling introduce KD through global softmax, which calculates the probability, leading to information redundancy between classes and diminishing the logit discrimination. In classical KD, the prediction probability difference between "cat" and "dog" derived from teacher's logit output is calculated as: $\Delta P_{KD}=|p_{1}^{t}-p_{2}^{t}|$=$|\frac{\exp({\mathbf{Z_{1}^{t}}})-\exp({\mathbf{Z_{2}^{t}}})}{\sum_{i=1}^C{\exp({\mathbf{Z_{i}^{t}}})}}|$. (b) In contrast, our proposed LDRLD uses category pairs and calculates probability difference between "cat" and "dog" as: $\Delta P_{LDRLD}=|p_{1}^{t}-p_{2}^{t}|$=$|\frac{\exp({\mathbf{Z_{1}^{t}}})-\exp({\mathbf{Z_{2}^{t}}})}{\sum_{i=1}^2{\exp({\mathbf{Z_{i}^{t}}})}}|$. It is obvious that $\Delta P_{LDRLD}>\Delta P_{KD}$, indicating that our approach enhances inter-class differences compared to KD and improves fine-grained logit discrimination.
  • Figure 2: Overview of the proposed LDRLD framework, which includes the three key loss functions: $\mathcal{L}^{w}$, $\mathcal{L}_{RNTK}$, and $\mathcal{L}_{LLKI}$
  • Figure 3: Impact of the depth $d$ on the performance of the student on CIFAR-100.
  • Figure 4: Feature map visualization of the student's penultimate layers on CIFAR-100 dataset using vanilla KD and LDRLD.
  • Figure 5: Visualization of the difference in correlation matrices between student and teacher logits for different teacher-student pairs: ResNet32$\times$4 vs ResNet8$\times$4, and ResNet32$\times$4 vs ShuffleNetV2, on the CIFAR-100 dataset.