LD4MRec: Simplifying and Powering Diffusion Model for Multimedia Recommendation

Jiarui Zhu; Jun Hou; Penghang Yu; Zhiyi Tan; Bing-Kun Bao

LD4MRec: Simplifying and Powering Diffusion Model for Multimedia Recommendation

Jiarui Zhu, Jun Hou, Penghang Yu, Zhiyi Tan, Bing-Kun Bao

TL;DR

This work addresses the challenge of noise in observed user behaviors for multimedia recommendation by proposing LD4MRec, a Light Diffusion model that enables real-time, forward-free inference. A Conditional neural Network (C-Net) guides generation using two signals: collaborative signals and personalized modality preference signals, with semi-supervised soft reconstruction to distill stable user preferences. The model is validated on three real-world datasets, demonstrating superior predictive performance and significant inference-time reductions compared with prior diffusion-based approaches. The approach offers practical improvements in robustness to noisy data and efficiency for deployment in real-time recommender systems.

Abstract

Multimedia recommendation aims to predict users' future behaviors based on observed behaviors and item content information. However, the inherent noise contained in observed behaviors easily leads to suboptimal recommendation performance. Recently, the diffusion model's ability to generate information from noise presents a promising solution to this issue, prompting us to explore its application in multimedia recommendation. Nonetheless, several challenges must be addressed: 1) The diffusion model requires simplification to meet the efficiency requirements of real-time recommender systems, 2) The generated behaviors must align with user preference. To address these challenges, we propose a Light Diffusion model for Multimedia Recommendation (LD4MRec). LD4MRec largely reduces computational complexity by employing a forward-free inference strategy, which directly predicts future behaviors from observed noisy behaviors. Meanwhile, to ensure the alignment between generated behaviors and user preference, we propose a novel Conditional neural Network (C-Net). C-Net achieves guided generation by leveraging two key signals, collaborative signals and personalized modality preference signals, thereby improving the semantic consistency between generated behaviors and user preference. Experiments conducted on three real-world datasets demonstrate the effectiveness of LD4MRec.

LD4MRec: Simplifying and Powering Diffusion Model for Multimedia Recommendation

TL;DR

Abstract

Paper Structure (23 sections, 22 equations, 6 figures, 5 tables, 2 algorithms)

This paper contains 23 sections, 22 equations, 6 figures, 5 tables, 2 algorithms.

Introduction
Related Work
Multimedia Recommendation
Diffusion models
Preliminary
Light Diffusion Model
Forward and Reverse Processes
Efficient Training and Inference
Discussion
C-Net
Dual Condition Signals
Forward Diffusion Step Embeddings
FC Layer
Collaboration-aware Generation Block (CG Block)
Preference-aware Generation Block (PG Block)
...and 8 more sections

Figures (6)

Figure 1: (a) Classic diffusion models generate data from Gaussian noise. (b) The proposed light diffusion model generates behaviors in a single step from observed noisy behaviors.
Figure 2: (a) The overall framework of LD4MRec. (b) CG-Block denoises the representations with the guidance of collaborative signals. (c) PG-Block generate behavior information under the control of user multimodal preferences.
Figure 3: Performance comparison between different forward steps.
Figure 4: Performance comparison between different noise levels.
Figure 5: Performance comparison between different variants.
...and 1 more figures

LD4MRec: Simplifying and Powering Diffusion Model for Multimedia Recommendation

TL;DR

Abstract

LD4MRec: Simplifying and Powering Diffusion Model for Multimedia Recommendation

Authors

TL;DR

Abstract

Table of Contents

Figures (6)