Curriculum-scheduled Knowledge Distillation from Multiple Pre-trained Teachers for Multi-domain Sequential Recommendation

Wenqi Sun; Ruobing Xie; Junjie Zhang; Wayne Xin Zhao; Leyu Lin; Ji-Rong Wen

Curriculum-scheduled Knowledge Distillation from Multiple Pre-trained Teachers for Multi-domain Sequential Recommendation

Wenqi Sun, Ruobing Xie, Junjie Zhang, Wayne Xin Zhao, Leyu Lin, Ji-Rong Wen

TL;DR

CKD-MDSR addresses the challenge of leveraging heterogeneous pre-trained recommendation models in multi-domain sequential recommendation by distilling knowledge from multiple PRMs into a lightweight student. It introduces curriculum-scheduled sampling to progressively learn from sequences of increasing difficulty, in-batch negative sampling to distill cross-teacher signals, and a consistency-aware integration mechanism to weigh and fuse knowledge from UniSRec, Recformer, and UniM$^2$Rec. The approach yields strong improvements across five real-world datasets, remains efficient at inference comparable to standard SR models, and proves its universality by applying to diverse student architectures (FM, DeepFM, LightGCN). The work demonstrates practical potential for deploying PRMs as a knowledge source without online overhead, enabling robust, cross-domain recommendations in production systems.

Abstract

Pre-trained recommendation models (PRMs) have received increasing interest recently. However, their intrinsically heterogeneous model structure, huge model size and computation cost hinder their adoptions in practical recommender systems. Hence, it is highly essential to explore how to use different pre-trained recommendation models efficiently in real-world systems. In this paper, we propose a novel curriculum-scheduled knowledge distillation from multiple pre-trained teachers for multi-domain sequential recommendation, called CKD-MDSR, which takes full advantages of different PRMs as multiple teacher models to boost a small student recommendation model, integrating the knowledge across multiple domains from PRMs. Specifically, CKD-MDSR first adopts curriculum-scheduled user behavior sequence sampling and distills informative knowledge jointly from the representative PRMs such as UniSRec and Recformer. Then, the knowledge from the above PRMs are selectively integrated into the student model in consideration of their confidence and consistency. Finally, we verify the proposed method on multi-domain sequential recommendation and further demonstrate its universality with multiple types of student models, including feature interaction and graph based recommendation models. Extensive experiments on five real-world datasets demonstrate the effectiveness and efficiency of CKD-MDSR, which can be viewed as an efficient shortcut using PRMs in real-world systems.

Curriculum-scheduled Knowledge Distillation from Multiple Pre-trained Teachers for Multi-domain Sequential Recommendation

TL;DR

Rec. The approach yields strong improvements across five real-world datasets, remains efficient at inference comparable to standard SR models, and proves its universality by applying to diverse student architectures (FM, DeepFM, LightGCN). The work demonstrates practical potential for deploying PRMs as a knowledge source without online overhead, enabling robust, cross-domain recommendations in production systems.

Abstract

Paper Structure (36 sections, 10 equations, 5 figures, 3 tables, 1 algorithm)

This paper contains 36 sections, 10 equations, 5 figures, 3 tables, 1 algorithm.

Introduction
Related Work
Pre-trained Recommendation Models
Multi-domain Sequential Recommendation
Knowledge Distillation for Recommendation
Methodology
Motivation
Problem formulation
Approach Overview
Curriculum-scheduled User Behavior Sequence Sampling
Difficulty Measurer
Curriculum Scheduler
Knowledge Distillation from Multiple PRMs
Pre-trained Recommendation Models in CKD-MDSR
In-batch Negative Sampling
...and 21 more sections

Figures (5)

Figure 1: Model comparisons w.r.t. the trade-off among inference time cost ($x$ axis), performance ($y$ axis, the geometric center of these circles), and memory cost (size of circles) on two datasets. Our proposed CKD-MDSR achieves good accuracy with the lowest online computation and memory costs.
Figure 2: An overview of our proposed CKD-MDSR.
Figure 3: Ablation study of CKD-MDSR on "Instruments" and "Scientific".
Figure 4: Universality analysis of CKD-MDSR on "Instruments" and "Arts".
Figure 5: Parameter analyses of CKD-MDSR on "Instruments", "Scientific" and "Arts".

Curriculum-scheduled Knowledge Distillation from Multiple Pre-trained Teachers for Multi-domain Sequential Recommendation

TL;DR

Abstract

Curriculum-scheduled Knowledge Distillation from Multiple Pre-trained Teachers for Multi-domain Sequential Recommendation

Authors

TL;DR

Abstract

Table of Contents

Figures (5)