Table of Contents
Fetching ...

MENTOR: Multi-level Self-supervised Learning for Multimodal Recommendation

Jinfeng Xu, Zheyu Chen, Shuo Yang, Jinze Li, Hewei Wang, Edith C. -H. Ngai

TL;DR

This work proposes a Multi-level sElf-supervised learNing for mulTimOdal Recommendation (MENTOR) method to address the label sparsity problem and the modality alignment problem and introduces two multilevel self-supervised tasks.

Abstract

With the increasing multimedia information, multimodal recommendation has received extensive attention. It utilizes multimodal information to alleviate the data sparsity problem in recommendation systems, thus improving recommendation accuracy. However, the reliance on labeled data severely limits the performance of multimodal recommendation models. Recently, self-supervised learning has been used in multimodal recommendations to mitigate the label sparsity problem. Nevertheless, the state-of-the-art methods cannot avoid the modality noise when aligning multimodal information due to the large differences in the distributions of different modalities. To this end, we propose a Multi-level sElf-supervised learNing for mulTimOdal Recommendation (MENTOR) method to address the label sparsity problem and the modality alignment problem. Specifically, MENTOR first enhances the specific features of each modality using the graph convolutional network (GCN) and fuses the visual and textual modalities. It then enhances the item representation via the item semantic graph for all modalities, including the fused modality. Then, it introduces two multilevel self-supervised tasks: the multilevel cross-modal alignment task and the general feature enhancement task. The multilevel cross-modal alignment task aligns each modality under the guidance of the ID embedding from multiple levels while maintaining the historical interaction information. The general feature enhancement task enhances the general feature from both the graph and feature perspectives to improve the robustness of our model. Extensive experiments on three publicly available datasets demonstrate the effectiveness of our method. Our code is publicly available at https://github.com/Jinfeng-Xu/MENTOR.

MENTOR: Multi-level Self-supervised Learning for Multimodal Recommendation

TL;DR

This work proposes a Multi-level sElf-supervised learNing for mulTimOdal Recommendation (MENTOR) method to address the label sparsity problem and the modality alignment problem and introduces two multilevel self-supervised tasks.

Abstract

With the increasing multimedia information, multimodal recommendation has received extensive attention. It utilizes multimodal information to alleviate the data sparsity problem in recommendation systems, thus improving recommendation accuracy. However, the reliance on labeled data severely limits the performance of multimodal recommendation models. Recently, self-supervised learning has been used in multimodal recommendations to mitigate the label sparsity problem. Nevertheless, the state-of-the-art methods cannot avoid the modality noise when aligning multimodal information due to the large differences in the distributions of different modalities. To this end, we propose a Multi-level sElf-supervised learNing for mulTimOdal Recommendation (MENTOR) method to address the label sparsity problem and the modality alignment problem. Specifically, MENTOR first enhances the specific features of each modality using the graph convolutional network (GCN) and fuses the visual and textual modalities. It then enhances the item representation via the item semantic graph for all modalities, including the fused modality. Then, it introduces two multilevel self-supervised tasks: the multilevel cross-modal alignment task and the general feature enhancement task. The multilevel cross-modal alignment task aligns each modality under the guidance of the ID embedding from multiple levels while maintaining the historical interaction information. The general feature enhancement task enhances the general feature from both the graph and feature perspectives to improve the robustness of our model. Extensive experiments on three publicly available datasets demonstrate the effectiveness of our method. Our code is publicly available at https://github.com/Jinfeng-Xu/MENTOR.
Paper Structure (35 sections, 32 equations, 5 figures, 3 tables)

This paper contains 35 sections, 32 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: The architecture of our MENTOR. We first utilize the graph convolutional network to extract specific features for each modality. Then, we fuse visual and textual modalities and explore the latent information with the item semantic graph based on these four modality representations of VT Fusion, ID, Visual, and Textual. We utilize an alignment self-supervised task (2) to align each modality without loss of interaction information. Besides, we leverage self-supervised tasks to enhance the general features on both the feature masking task (1) and the graph perturbation task (3).
  • Figure 2: Effect of multilevel cross-modal alignment.
  • Figure 3: The distribution of representations includes textual and visual modalities. Figure (a) and (c) show the distribution of MENTOR$_{base}$, and Figure (b) and (d) show the distribution of MENTOR. Blue represents the textual modality and green represents the visual modality.
  • Figure 4: Effect of the balancing hyper-parameter $\lambda_{align}$.
  • Figure 5: Performance of MENTOR with respect to different hyper-parameter pairs ($p$,$\lambda_f$) and ($\tau$,$\lambda_g$). Darker color denotes better performance of recommendation.