Table of Contents
Fetching ...

Exploring Feature-based Knowledge Distillation for Recommender System: A Frequency Perspective

Zhangchi Zhu, Wei Zhang

TL;DR

This work analyzes feature-based knowledge distillation for recommender systems through a frequency lens, defining knowledge as the $k$-th frequency component and showing that standard FD minimizes all frequencies equally, which can under-allocate emphasis to critical low-frequency knowledge. It introduces a reweighting scheme and a lightweight method, FreqD, that uses graph filtering with a polynomial filter $h(\lambda)$ to emphasize important knowledge without incurring high computational costs. Empirical results on three public datasets across multiple backbones demonstrate that FreqD consistently outperforms existing KD methods and can approach teacher performance while offering substantial inference and training efficiency gains. The proposed approach provides both theoretical insight and a practical tool for more effective knowledge transfer in large-scale recommender systems, with broad implications for frequency-aware distillation in graph-based models.

Abstract

In this paper, we analyze the feature-based knowledge distillation for recommendation from the frequency perspective. By defining knowledge as different frequency components of the features, we theoretically demonstrate that regular feature-based knowledge distillation is equivalent to equally minimizing losses on all knowledge and further analyze how this equal loss weight allocation method leads to important knowledge being overlooked. In light of this, we propose to emphasize important knowledge by redistributing knowledge weights. Furthermore, we propose FreqD, a lightweight knowledge reweighting method, to avoid the computational cost of calculating losses on each knowledge. Extensive experiments demonstrate that FreqD consistently and significantly outperforms state-of-the-art knowledge distillation methods for recommender systems. Our code is available at https://github.com/woriazzc/KDs.

Exploring Feature-based Knowledge Distillation for Recommender System: A Frequency Perspective

TL;DR

This work analyzes feature-based knowledge distillation for recommender systems through a frequency lens, defining knowledge as the -th frequency component and showing that standard FD minimizes all frequencies equally, which can under-allocate emphasis to critical low-frequency knowledge. It introduces a reweighting scheme and a lightweight method, FreqD, that uses graph filtering with a polynomial filter to emphasize important knowledge without incurring high computational costs. Empirical results on three public datasets across multiple backbones demonstrate that FreqD consistently outperforms existing KD methods and can approach teacher performance while offering substantial inference and training efficiency gains. The proposed approach provides both theoretical insight and a practical tool for more effective knowledge transfer in large-scale recommender systems, with broad implications for frequency-aware distillation in graph-based models.

Abstract

In this paper, we analyze the feature-based knowledge distillation for recommendation from the frequency perspective. By defining knowledge as different frequency components of the features, we theoretically demonstrate that regular feature-based knowledge distillation is equivalent to equally minimizing losses on all knowledge and further analyze how this equal loss weight allocation method leads to important knowledge being overlooked. In light of this, we propose to emphasize important knowledge by redistributing knowledge weights. Furthermore, we propose FreqD, a lightweight knowledge reweighting method, to avoid the computational cost of calculating losses on each knowledge. Extensive experiments demonstrate that FreqD consistently and significantly outperforms state-of-the-art knowledge distillation methods for recommender systems. Our code is available at https://github.com/woriazzc/KDs.

Paper Structure

This paper contains 30 sections, 3 theorems, 19 equations, 4 figures, 8 tables.

Key Result

Theorem 1

Consider the feature-based distillation loss: Then $\mathcal{L}_{FD}$ can be decomposed into losses on all types of knowledge. Formally,

Figures (4)

  • Figure 1: Feature-based distillation loss on different knowledge groups. Colors indicate Recall@20 of these students. The two backbones used in the experiments are BPRMF (left) and LightGCN (right), respectively.
  • Figure 2: Feature-based distillation loss on different knowledge groups for different weight allocation schemes.
  • Figure 3: Effects of $\alpha$.
  • Figure 4: Effects of $\beta$.

Theorems & Definitions (4)

  • Definition 4.1: Knowledge in features
  • Theorem 1
  • Theorem 2
  • Theorem 3