Table of Contents
Fetching ...

Lightweight Modality Adaptation to Sequential Recommendation via Correlation Supervision

Hengchang Hu, Qijiong Liu, Chuang Li, Min-Yen Kan

TL;DR

The paper addresses modality forgetting in sequential recommender systems by proposing a lightweight two-stage approach, KDSR, which preserves modality information through correlation distillation learning (CDL). CDL introduces two supervision signals—holistic correlations and dissected correlation codes—derived from modality encoders via an autoencoder and vector quantization, and it employs soft matching to align teacher and student embeddings. An asynchronous training strategy helps maintain original modality details during representation learning, and the method is compatible with various backbones and encoders. Empirical results across multiple datasets show that KDSR outperforms top baselines by an average of 6.8% in HR, with additional insights that larger modality encoders benefit from more fine-grained correlation modeling, validating the approach's practicality and scalability.

Abstract

In Sequential Recommenders (SR), encoding and utilizing modalities in an end-to-end manner is costly in terms of modality encoder sizes. Two-stage approaches can mitigate such concerns, but they suffer from poor performance due to modality forgetting, where the sequential objective overshadows modality representation. We propose a lightweight knowledge distillation solution that preserves both merits: retaining modality information and maintaining high efficiency. Specifically, we introduce a novel method that enhances the learning of embeddings in SR through the supervision of modality correlations. The supervision signals are distilled from the original modality representations, including both (1) holistic correlations, which quantify their overall associations, and (2) dissected correlation types, which refine their relationship facets (honing in on specific aspects like color or shape consistency). To further address the issue of modality forgetting, we propose an asynchronous learning step, allowing the original information to be retained longer for training the representation learning module. Our approach is compatible with various backbone architectures and outperforms the top baselines by 6.8% on average. We empirically demonstrate that preserving original feature associations from modality encoders significantly boosts task-specific recommendation adaptation. Additionally, we find that larger modality encoders (e.g., Large Language Models) contain richer feature sets which necessitate more fine-grained modeling to reach their full performance potential.

Lightweight Modality Adaptation to Sequential Recommendation via Correlation Supervision

TL;DR

The paper addresses modality forgetting in sequential recommender systems by proposing a lightweight two-stage approach, KDSR, which preserves modality information through correlation distillation learning (CDL). CDL introduces two supervision signals—holistic correlations and dissected correlation codes—derived from modality encoders via an autoencoder and vector quantization, and it employs soft matching to align teacher and student embeddings. An asynchronous training strategy helps maintain original modality details during representation learning, and the method is compatible with various backbones and encoders. Empirical results across multiple datasets show that KDSR outperforms top baselines by an average of 6.8% in HR, with additional insights that larger modality encoders benefit from more fine-grained correlation modeling, validating the approach's practicality and scalability.

Abstract

In Sequential Recommenders (SR), encoding and utilizing modalities in an end-to-end manner is costly in terms of modality encoder sizes. Two-stage approaches can mitigate such concerns, but they suffer from poor performance due to modality forgetting, where the sequential objective overshadows modality representation. We propose a lightweight knowledge distillation solution that preserves both merits: retaining modality information and maintaining high efficiency. Specifically, we introduce a novel method that enhances the learning of embeddings in SR through the supervision of modality correlations. The supervision signals are distilled from the original modality representations, including both (1) holistic correlations, which quantify their overall associations, and (2) dissected correlation types, which refine their relationship facets (honing in on specific aspects like color or shape consistency). To further address the issue of modality forgetting, we propose an asynchronous learning step, allowing the original information to be retained longer for training the representation learning module. Our approach is compatible with various backbone architectures and outperforms the top baselines by 6.8% on average. We empirically demonstrate that preserving original feature associations from modality encoders significantly boosts task-specific recommendation adaptation. Additionally, we find that larger modality encoders (e.g., Large Language Models) contain richer feature sets which necessitate more fine-grained modeling to reach their full performance potential.
Paper Structure (12 sections, 7 equations, 4 figures, 4 tables)

This paper contains 12 sections, 7 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: (left) Diagram of adaptation from original modality representation to modality embedding over training epochs. (right) Experiment on the Beauty dataset with GRU-based models, highlighting the issue of modality forgetting.
  • Figure 2: (left) Our KDSR framework. Green, orange, and blue denote image modality, text modality, and ID features, respectively. (right) Detail of Correlation Distillation Learning on the image (green) modality.
  • Figure 3: Compatibility study with different representation learning backbones.
  • Figure 4: Evaluation on two datasets equipped with MMSR backbone and different modality encoders (image and text) with various parameter sizes. R* indicates the ResNet he2016deep with different sizes. Swin-T/B are the transformer-based models liu2021swin in two sizes. T5-b/l indicates the base and large version of T5 raffel2020exploring. ChatGLM zeng2023glm130b and LLaMA-13B touvron2023llama are two recently introduced LLMs.