Rethinking Knowledge in Distillation: An In-context Sample Retrieval Perspective

Jinjing Zhu; Songze Li; Lin Wang

Rethinking Knowledge in Distillation: An In-context Sample Retrieval Perspective

Jinjing Zhu, Songze Li, Lin Wang

TL;DR

This work redefines knowledge in distillation by leveraging in-context sample retrieval to capture relationships among samples. It introduces IC-KD, which uses a teacher-constructed memory bank to fetch in-context samples and applies two logit-space regularizers, PICD and NICD, to regularize the student alongside standard KD losses. The approach demonstrates strong, often state-of-the-art, performance gains across offline, online, and teacher-free KD settings on CIFAR-100, ImageNet, and semantic segmentation tasks, while offering insights into retrieval-based regularization and efficiency. The findings highlight the practical potential of incorporating in-context sample relationships to enhance knowledge transfer in compact models, with clear avenues for extending to feature-space regularization in future work.

Abstract

Conventional knowledge distillation (KD) approaches are designed for the student model to predict similar output as the teacher model for each sample. Unfortunately, the relationship across samples with same class is often neglected. In this paper, we explore to redefine the knowledge in distillation, capturing the relationship between each sample and its corresponding in-context samples (a group of similar samples with the same or different classes), and perform KD from an in-context sample retrieval perspective. As KD is a type of learned label smoothing regularization (LSR), we first conduct a theoretical analysis showing that the teacher's knowledge from the in-context samples is a crucial contributor to regularize the student training with the corresponding samples. Buttressed by the analysis, we propose a novel in-context knowledge distillation (IC-KD) framework that shows its superiority across diverse KD paradigms (offline, online, and teacher-free KD). Firstly, we construct a feature memory bank from the teacher model and retrieve in-context samples for each corresponding sample through retrieval-based learning. We then introduce Positive In-Context Distillation (PICD) to reduce the discrepancy between a sample from the student and the aggregated in-context samples with the same class from the teacher in the logit space. Moreover, Negative In-Context Distillation (NICD) is introduced to separate a sample from the student and the in-context samples with different classes from the teacher in the logit space. Extensive experiments demonstrate that IC-KD is effective across various types of KD, and consistently achieves state-of-the-art performance on CIFAR-100 and ImageNet datasets.

Rethinking Knowledge in Distillation: An In-context Sample Retrieval Perspective

TL;DR

Abstract

Paper Structure (10 sections, 11 equations, 7 figures, 11 tables, 1 algorithm)

This paper contains 10 sections, 11 equations, 7 figures, 11 tables, 1 algorithm.

Introduction
Related Work
Method
Rethinking 'Knowledge' in Distillation
Distillation via In-Context Sample Retrieval
Total objective
Experiments
Settings
Results and Discussion
Conclusion and Limitations

Figures (7)

Figure 1: Illustration of four typical approaches: classic KD, Relation-based KD, Contrastive KD, and our IC-KD. Note that different shapes represent samples of different classes.
Figure 2: Comparative results of various teacher-student pairs on CIFAR-100. Note that IC-KD-all refers to the method of knowledge distillation by capturing the relationships across all samples of the same class within the logit space.
Figure 3: Overview of our framework. It consists of two feature extractors ($f^t$ and $f^{s}$) and two classifiers ($g^s$ and $g^t$). Given a sample $x_i$, we extract its feature $h_{i}^s$, and obtain predictions $p^s_{i}$ from the corresponding classifier $g^s$. Moreover, we build a feature memory bank {$\boldsymbol{h}_{1}^t$, ..., $\boldsymbol{h}_{j}^t$} using all the database with the feature extractor $f^t$, and obtain the feature $h^t_i$ of sample $x_i$. Then we calculate the similarity of feature $h^t_i$ and feature memory bank, and utilize the ground truth of samples to identify the Top-$\mathcal{K}$ and Top-$\mathcal{N}$ similar features (in-context features) from same and different classes, respectively. Finally, we regularize the student training with two proposed loss terms: $\mathcal{L}_{picd}$ and $\mathcal{L}_{nicd}$.
Figure 4: Effects of varying value of $\beta_{1}$, $\mathcal{K}$, $\tau_{1}$, and $\beta_{2}$ on CIFAR-100 with ResNet32x4 $\to$ ResNet8x4.
Figure 5: Effects of varying the value of ${\gamma}_{picd}$ and ${\gamma}_{nicd}$ on CIFAR-100 with ResNet32x4$\to$ResNet8x4.
...and 2 more figures

Rethinking Knowledge in Distillation: An In-context Sample Retrieval Perspective

TL;DR

Abstract

Rethinking Knowledge in Distillation: An In-context Sample Retrieval Perspective

Authors

TL;DR

Abstract

Table of Contents

Figures (7)