Table of Contents
Fetching ...

PromptMM: Multi-Modal Knowledge Distillation for Recommendation with Prompt-Tuning

Wei Wei, Jiabin Tang, Yangqin Jiang, Lianghao Xia, Chao Huang

TL;DR

PromptMM tackles overfitting and noise in multi-modal recommenders by distilling knowledge from a heavy modality-aware teacher into a lightweight student. It introduces soft prompt-tuning as a semantic bridge and proposes disentangled modality-aware ranking KD alongside embedding KD, enabling task-adaptive, robust knowledge transfer. The approach yields a compact, efficient model with superior accuracy across Netflix, TikTok, and Electronics datasets, supported by extensive ablations and resource analyses. This framework offers practical gains in online inference speed and memory usage while maintaining or improving recommendation quality, suitable for real-world multimodal systems.

Abstract

Multimedia online platforms (e.g., Amazon, TikTok) have greatly benefited from the incorporation of multimedia (e.g., visual, textual, and acoustic) content into their personal recommender systems. These modalities provide intuitive semantics that facilitate modality-aware user preference modeling. However, two key challenges in multi-modal recommenders remain unresolved: i) The introduction of multi-modal encoders with a large number of additional parameters causes overfitting, given high-dimensional multi-modal features provided by extractors (e.g., ViT, BERT). ii) Side information inevitably introduces inaccuracies and redundancies, which skew the modality-interaction dependency from reflecting true user preference. To tackle these problems, we propose to simplify and empower recommenders through Multi-modal Knowledge Distillation (PromptMM) with the prompt-tuning that enables adaptive quality distillation. Specifically, PromptMM conducts model compression through distilling u-i edge relationship and multi-modal node content from cumbersome teachers to relieve students from the additional feature reduction parameters. To bridge the semantic gap between multi-modal context and collaborative signals for empowering the overfitting teacher, soft prompt-tuning is introduced to perform student task-adaptive. Additionally, to adjust the impact of inaccuracies in multimedia data, a disentangled multi-modal list-wise distillation is developed with modality-aware re-weighting mechanism. Experiments on real-world data demonstrate PromptMM's superiority over existing techniques. Ablation tests confirm the effectiveness of key components. Additional tests show the efficiency and effectiveness.

PromptMM: Multi-Modal Knowledge Distillation for Recommendation with Prompt-Tuning

TL;DR

PromptMM tackles overfitting and noise in multi-modal recommenders by distilling knowledge from a heavy modality-aware teacher into a lightweight student. It introduces soft prompt-tuning as a semantic bridge and proposes disentangled modality-aware ranking KD alongside embedding KD, enabling task-adaptive, robust knowledge transfer. The approach yields a compact, efficient model with superior accuracy across Netflix, TikTok, and Electronics datasets, supported by extensive ablations and resource analyses. This framework offers practical gains in online inference speed and memory usage while maintaining or improving recommendation quality, suitable for real-world multimodal systems.

Abstract

Multimedia online platforms (e.g., Amazon, TikTok) have greatly benefited from the incorporation of multimedia (e.g., visual, textual, and acoustic) content into their personal recommender systems. These modalities provide intuitive semantics that facilitate modality-aware user preference modeling. However, two key challenges in multi-modal recommenders remain unresolved: i) The introduction of multi-modal encoders with a large number of additional parameters causes overfitting, given high-dimensional multi-modal features provided by extractors (e.g., ViT, BERT). ii) Side information inevitably introduces inaccuracies and redundancies, which skew the modality-interaction dependency from reflecting true user preference. To tackle these problems, we propose to simplify and empower recommenders through Multi-modal Knowledge Distillation (PromptMM) with the prompt-tuning that enables adaptive quality distillation. Specifically, PromptMM conducts model compression through distilling u-i edge relationship and multi-modal node content from cumbersome teachers to relieve students from the additional feature reduction parameters. To bridge the semantic gap between multi-modal context and collaborative signals for empowering the overfitting teacher, soft prompt-tuning is introduced to perform student task-adaptive. Additionally, to adjust the impact of inaccuracies in multimedia data, a disentangled multi-modal list-wise distillation is developed with modality-aware re-weighting mechanism. Experiments on real-world data demonstrate PromptMM's superiority over existing techniques. Ablation tests confirm the effectiveness of key components. Additional tests show the efficiency and effectiveness.
Paper Structure (33 sections, 15 equations, 5 figures, 9 tables)

This paper contains 33 sections, 15 equations, 5 figures, 9 tables.

Figures (5)

  • Figure 1: PromptMM is to learn a lightweight recommender with minimal online consumption, including three types of KD: i) ranking KD; ii) denoised modality-aware ranking KD; iii) modality-aware embedding KD. Besides, prompt-tuning is for adaptive task-relevant KD; disentangling and re-weighting are introduced to adjust the impact of noise in modalities.
  • Figure 2: Calculation Example of Disentangled KD
  • Figure 3: t-SNE Visualization on Tiktok for raw high dimensional multi-modal features $\textbf{X}^m$, modality-specific representations $\textbf{F}^m$ of PromptMM and $\textbf{F}^m$ of variant w/o-Prompt.
  • Figure 4: Impact study of hyperparameters in PromptMM.
  • Figure 5: Training curves of PromptMM framework in terms of Recall@20, NDCG@20, and $\mathcal{L}$ on Tiktok dataset.