Table of Contents
Fetching ...

Rejuvenating Cross-Entropy Loss in Knowledge Distillation for Recommender Systems

Zhangchi Zhu, Wei Zhang

TL;DR

This paper analyzes Cross-Entropy loss in knowledge distillation (KD) for recommender systems and proposes Rejuvenated Cross-Entropy for Knowledge Distillation (RCE-KD), which splits the top items given by the teacher into two subsets based on whether they are highly ranked by the student.

Abstract

This paper analyzes Cross-Entropy (CE) loss in knowledge distillation (KD) for recommender systems. KD for recommender systems targets at distilling rankings, especially among items most likely to be preferred, and can only be computed on a small subset of items. Considering these features, we reveal the connection between CE loss and NDCG in the field of KD. We prove that when performing KD on an item subset, minimizing CE loss maximizes the lower bound of NDCG, only if an assumption of closure is satisfied. It requires that the item subset consists of the student's top items. However, this contradicts our goal of distilling rankings of the teacher's top items. We empirically demonstrate the vast gap between these two kinds of top items. To bridge the gap between our goal and theoretical support, we propose Rejuvenated Cross-Entropy for Knowledge Distillation (RCE-KD). It splits the top items given by the teacher into two subsets based on whether they are highly ranked by the student. For the subset that defies the condition, a sampling strategy is devised to use teacher-student collaboration to approximate our assumption of closure. We also combine the losses on the two subsets adaptively. Extensive experiments demonstrate the effectiveness of our method. Our code is available at https://github.com/BDML-lab/RCE-KD.

Rejuvenating Cross-Entropy Loss in Knowledge Distillation for Recommender Systems

TL;DR

This paper analyzes Cross-Entropy loss in knowledge distillation (KD) for recommender systems and proposes Rejuvenated Cross-Entropy for Knowledge Distillation (RCE-KD), which splits the top items given by the teacher into two subsets based on whether they are highly ranked by the student.

Abstract

This paper analyzes Cross-Entropy (CE) loss in knowledge distillation (KD) for recommender systems. KD for recommender systems targets at distilling rankings, especially among items most likely to be preferred, and can only be computed on a small subset of items. Considering these features, we reveal the connection between CE loss and NDCG in the field of KD. We prove that when performing KD on an item subset, minimizing CE loss maximizes the lower bound of NDCG, only if an assumption of closure is satisfied. It requires that the item subset consists of the student's top items. However, this contradicts our goal of distilling rankings of the teacher's top items. We empirically demonstrate the vast gap between these two kinds of top items. To bridge the gap between our goal and theoretical support, we propose Rejuvenated Cross-Entropy for Knowledge Distillation (RCE-KD). It splits the top items given by the teacher into two subsets based on whether they are highly ranked by the student. For the subset that defies the condition, a sampling strategy is devised to use teacher-student collaboration to approximate our assumption of closure. We also combine the losses on the two subsets adaptively. Extensive experiments demonstrate the effectiveness of our method. Our code is available at https://github.com/BDML-lab/RCE-KD.

Paper Structure

This paper contains 41 sections, 2 theorems, 24 equations, 10 figures, 15 tables.

Key Result

Theorem 4.1

Suppose that we compute CE loss on the entire item set $\mathcal{I}$ and take the teacher's predicted scores (i.e., $\boldsymbol r_u^T$) as the target. In that case, we maximize a lower bound of NDCG, with the teacher's transformed predictive scores $\boldsymbol{y}=\log_2(\sigma(\boldsymbol r_u^T)+1

Figures (10)

  • Figure 1: Performance comparison of different KD methods. We report the results in three homogeneous Teacher $\to$ Student settings.
  • Figure 2: Relationship between rankings given by the teacher (shown in $x$-axis) and the student (shown in $y$-axis). Items are sorted in decreasing order according to the teacher's rankings.
  • Figure 3: Ablation study on Gowalla and Yelp, including the results in three homogeneous Teacher $\to$ Student settings.
  • Figure 4: Relationship between rankings given by the teacher (shown in $x$-axis) and the student (shown in $y$-axis) on all datasets. Items are sorted in decreasing order according to the teacher's rankings.
  • Figure 5: Training curves of NDCG@10 on the training set for RCE-KD and Vanilla CE across three datasets (CiteULike, Gowalla, Yelp) and five teacher-student architecture combinations.
  • ...and 5 more figures

Theorems & Definitions (5)

  • Theorem 4.1
  • Definition 4.2: Partial $\text{NDCG}$
  • Theorem 4.4
  • proof
  • proof