Table of Contents
Fetching ...

Continual Collaborative Distillation for Recommender System

Gyuseok Lee, SeongKu Kang, Wonbin Kweon, Hwanjo Yu

TL;DR

CCD tackles the challenge of maintaining high-quality recommendations under non-stationary data streams by introducing a continual, collaborative KD framework where a large teacher and a compact student evolve together across data blocks. The method alternates three stages per teacher cycle: distill a compact student via KD, continually update the student with new interactions using embedding initialization and proxy-guided replay to mitigate forgetting, and update the teacher using both standard losses and student-informed signals with an annealed cross-knowledge term. Key innovations include stability and plasticity proxies, proxy-guided replay learning, and a teacher-update objective that leverages student-side knowledge, collectively yielding improved plasticity and stability (LA/RA) and favorable accuracy-efficiency trade-offs. Experiments on Gowalla and Yelp demonstrate that CCD consistently outperforms state-of-the-art baselines, with collaborative evolution between teacher and student leading to cumulative gains over blocks and practical deployment benefits due to reduced training cost and latency.

Abstract

Knowledge distillation (KD) has emerged as a promising technique for addressing the computational challenges associated with deploying large-scale recommender systems. KD transfers the knowledge of a massive teacher system to a compact student model, to reduce the huge computational burdens for inference while retaining high accuracy. The existing KD studies primarily focus on one-time distillation in static environments, leaving a substantial gap in their applicability to real-world scenarios dealing with continuously incoming users, items, and their interactions. In this work, we delve into a systematic approach to operating the teacher-student KD in a non-stationary data stream. Our goal is to enable efficient deployment through a compact student, which preserves the high performance of the massive teacher, while effectively adapting to continuously incoming data. We propose Continual Collaborative Distillation (CCD) framework, where both the teacher and the student continually and collaboratively evolve along the data stream. CCD facilitates the student in effectively adapting to new data, while also enabling the teacher to fully leverage accumulated knowledge. We validate the effectiveness of CCD through extensive quantitative, ablative, and exploratory experiments on two real-world datasets. We expect this research direction to contribute to narrowing the gap between existing KD studies and practical applications, thereby enhancing the applicability of KD in real-world systems.

Continual Collaborative Distillation for Recommender System

TL;DR

CCD tackles the challenge of maintaining high-quality recommendations under non-stationary data streams by introducing a continual, collaborative KD framework where a large teacher and a compact student evolve together across data blocks. The method alternates three stages per teacher cycle: distill a compact student via KD, continually update the student with new interactions using embedding initialization and proxy-guided replay to mitigate forgetting, and update the teacher using both standard losses and student-informed signals with an annealed cross-knowledge term. Key innovations include stability and plasticity proxies, proxy-guided replay learning, and a teacher-update objective that leverages student-side knowledge, collectively yielding improved plasticity and stability (LA/RA) and favorable accuracy-efficiency trade-offs. Experiments on Gowalla and Yelp demonstrate that CCD consistently outperforms state-of-the-art baselines, with collaborative evolution between teacher and student leading to cumulative gains over blocks and practical deployment benefits due to reduced training cost and latency.

Abstract

Knowledge distillation (KD) has emerged as a promising technique for addressing the computational challenges associated with deploying large-scale recommender systems. KD transfers the knowledge of a massive teacher system to a compact student model, to reduce the huge computational burdens for inference while retaining high accuracy. The existing KD studies primarily focus on one-time distillation in static environments, leaving a substantial gap in their applicability to real-world scenarios dealing with continuously incoming users, items, and their interactions. In this work, we delve into a systematic approach to operating the teacher-student KD in a non-stationary data stream. Our goal is to enable efficient deployment through a compact student, which preserves the high performance of the massive teacher, while effectively adapting to continuously incoming data. We propose Continual Collaborative Distillation (CCD) framework, where both the teacher and the student continually and collaboratively evolve along the data stream. CCD facilitates the student in effectively adapting to new data, while also enabling the teacher to fully leverage accumulated knowledge. We validate the effectiveness of CCD through extensive quantitative, ablative, and exploratory experiments on two real-world datasets. We expect this research direction to contribute to narrowing the gap between existing KD studies and practical applications, thereby enhancing the applicability of KD in real-world systems.
Paper Structure (35 sections, 10 equations, 5 figures, 10 tables, 1 algorithm)

This paper contains 35 sections, 10 equations, 5 figures, 10 tables, 1 algorithm.

Figures (5)

  • Figure 1: A conceptual comparison of (a) knowledge distillation, (b) continual learning, and (c) the proposed continual collaborative distillation. $C^T$ and $C^S$ denote the update cycle for the massive system (e.g., a weekly update) and the compact model (e.g., a daily update), respectively.
  • Figure 2: Overview of CCD framework for $k$-th data block.
  • Figure 3: Effects of proxy. After $D_4$ and $D_5$ (y-axis), we assess Recall@20 on test sets from the previous blocks (x-axis). A naive update of the student results in significant catastrophic forgetting. Two proxies effectively accumulate the previous knowledge from complementary perspectives. (Dataset: Yelp)
  • Figure 4: H-mean gain of CCD over the best CL competitor.
  • Figure 5: Performance with varying replay sizes.