Table of Contents
Fetching ...

DiffMM: Multi-Modal Diffusion Model for Recommendation

Yangqin Jiang, Lianghao Xia, Wei Wei, Da Luo, Kangyi Lin, Chao Huang

TL;DR

DiffMM tackles data sparsity in multi-modal recommendations by introducing a modality-aware diffusion graph generator and cross-modal contrastive learning to align multi-modal context with collaborative signals. It fuses a forward-backward diffusion process, modality-aware signal injection, and multi-modal graph aggregation to produce robust modality-specific user/item representations. Empirical results on TikTok and Amazon datasets show consistent performance gains over diverse baselines, and ablation studies highlight the importance of diffusion, contrastive augmentation, and MSI. The framework offers a principled, scalable approach to leveraging multi-modal data for recommender systems and points to future directions involving language-model-guided diffusion for richer augmentations.

Abstract

The rise of online multi-modal sharing platforms like TikTok and YouTube has enabled personalized recommender systems to incorporate multiple modalities (such as visual, textual, and acoustic) into user representations. However, addressing the challenge of data sparsity in these systems remains a key issue. To address this limitation, recent research has introduced self-supervised learning techniques to enhance recommender systems. However, these methods often rely on simplistic random augmentation or intuitive cross-view information, which can introduce irrelevant noise and fail to accurately align the multi-modal context with user-item interaction modeling. To fill this research gap, we propose a novel multi-modal graph diffusion model for recommendation called DiffMM. Our framework integrates a modality-aware graph diffusion model with a cross-modal contrastive learning paradigm to improve modality-aware user representation learning. This integration facilitates better alignment between multi-modal feature information and collaborative relation modeling. Our approach leverages diffusion models' generative capabilities to automatically generate a user-item graph that is aware of different modalities, facilitating the incorporation of useful multi-modal knowledge in modeling user-item interactions. We conduct extensive experiments on three public datasets, consistently demonstrating the superiority of our DiffMM over various competitive baselines. For open-sourced model implementation details, you can access the source codes of our proposed framework at: https://github.com/HKUDS/DiffMM .

DiffMM: Multi-Modal Diffusion Model for Recommendation

TL;DR

DiffMM tackles data sparsity in multi-modal recommendations by introducing a modality-aware diffusion graph generator and cross-modal contrastive learning to align multi-modal context with collaborative signals. It fuses a forward-backward diffusion process, modality-aware signal injection, and multi-modal graph aggregation to produce robust modality-specific user/item representations. Empirical results on TikTok and Amazon datasets show consistent performance gains over diverse baselines, and ablation studies highlight the importance of diffusion, contrastive augmentation, and MSI. The framework offers a principled, scalable approach to leveraging multi-modal data for recommender systems and points to future directions involving language-model-guided diffusion for richer augmentations.

Abstract

The rise of online multi-modal sharing platforms like TikTok and YouTube has enabled personalized recommender systems to incorporate multiple modalities (such as visual, textual, and acoustic) into user representations. However, addressing the challenge of data sparsity in these systems remains a key issue. To address this limitation, recent research has introduced self-supervised learning techniques to enhance recommender systems. However, these methods often rely on simplistic random augmentation or intuitive cross-view information, which can introduce irrelevant noise and fail to accurately align the multi-modal context with user-item interaction modeling. To fill this research gap, we propose a novel multi-modal graph diffusion model for recommendation called DiffMM. Our framework integrates a modality-aware graph diffusion model with a cross-modal contrastive learning paradigm to improve modality-aware user representation learning. This integration facilitates better alignment between multi-modal feature information and collaborative relation modeling. Our approach leverages diffusion models' generative capabilities to automatically generate a user-item graph that is aware of different modalities, facilitating the incorporation of useful multi-modal knowledge in modeling user-item interactions. We conduct extensive experiments on three public datasets, consistently demonstrating the superiority of our DiffMM over various competitive baselines. For open-sourced model implementation details, you can access the source codes of our proposed framework at: https://github.com/HKUDS/DiffMM .
Paper Structure (34 sections, 26 equations, 5 figures, 5 tables)

This paper contains 34 sections, 26 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: The overall framework of the proposed multi-modal diffusion model (DiffMM).
  • Figure 2: Performance w.r.t. user interaction numbers.
  • Figure 3: Hyperparameter analysis on different datasets.
  • Figure 4: Comparison between diffusion-enhanced data augmentation and random augmentation. The results show performance w.r.t the fusion ratio to combine two views.
  • Figure 5: Case study on the generated modality-aware user-item graph, using visual modality from Amazon-Baby data.