Table of Contents
Fetching ...

Structurally Refined Graph Transformer for Multimodal Recommendation

Ke Shi, Yan Zhang, Miao Zhang, Lifan Chen, Jiali Yi, Kui Xiao, Xiaoju Hou, Zhifei Li

TL;DR

SRGFormer addresses the challenge of learning user preferences from multimodal data by refining both global and local user-item structures through a transformer-based attention mechanism and a multimodal hypergraph. It introduces self-supervised cross-modal tasks and a Gumbel-Softmax-based hyperedge construction to align modalities and capture local dependencies, while a modified transformer extracts global patterns. Empirical results on three public datasets show consistent improvements over state-of-the-art baselines, with notable gains in Sports and Baby datasets and clear evidence of the importance of each modality and component. This approach offers a practical, scalable pathway to more accurate and interpretable multimodal recommendations in diverse domains.

Abstract

Multimodal recommendation systems utilize various types of information, including images and text, to enhance the effectiveness of recommendations. The key challenge is predicting user purchasing behavior from the available data. Current recommendation models prioritize extracting multimodal information while neglecting the distinction between redundant and valuable data. They also rely heavily on a single semantic framework (e.g., local or global semantics), resulting in an incomplete or biased representation of user preferences, particularly those less expressed in prior interactions. Furthermore, these approaches fail to capture the complex interactions between users and items, limiting the model's ability to meet diverse users. To address these challenges, we present SRGFormer, a structurally optimized multimodal recommendation model. By modifying the transformer for better integration into our model, we capture the overall behavior patterns of users. Then, we enhance structural information by embedding multimodal information into a hypergraph structure to aid in learning the local structures between users and items. Meanwhile, applying self-supervised tasks to user-item collaborative signals enhances the integration of multimodal information, thereby revealing the representational features inherent to the data's modality. Extensive experiments on three public datasets reveal that SRGFormer surpasses previous benchmark models, achieving an average performance improvement of 4.47 percent on the Sports dataset. The code is publicly available online.

Structurally Refined Graph Transformer for Multimodal Recommendation

TL;DR

SRGFormer addresses the challenge of learning user preferences from multimodal data by refining both global and local user-item structures through a transformer-based attention mechanism and a multimodal hypergraph. It introduces self-supervised cross-modal tasks and a Gumbel-Softmax-based hyperedge construction to align modalities and capture local dependencies, while a modified transformer extracts global patterns. Empirical results on three public datasets show consistent improvements over state-of-the-art baselines, with notable gains in Sports and Baby datasets and clear evidence of the importance of each modality and component. This approach offers a practical, scalable pathway to more accurate and interpretable multimodal recommendations in diverse domains.

Abstract

Multimodal recommendation systems utilize various types of information, including images and text, to enhance the effectiveness of recommendations. The key challenge is predicting user purchasing behavior from the available data. Current recommendation models prioritize extracting multimodal information while neglecting the distinction between redundant and valuable data. They also rely heavily on a single semantic framework (e.g., local or global semantics), resulting in an incomplete or biased representation of user preferences, particularly those less expressed in prior interactions. Furthermore, these approaches fail to capture the complex interactions between users and items, limiting the model's ability to meet diverse users. To address these challenges, we present SRGFormer, a structurally optimized multimodal recommendation model. By modifying the transformer for better integration into our model, we capture the overall behavior patterns of users. Then, we enhance structural information by embedding multimodal information into a hypergraph structure to aid in learning the local structures between users and items. Meanwhile, applying self-supervised tasks to user-item collaborative signals enhances the integration of multimodal information, thereby revealing the representational features inherent to the data's modality. Extensive experiments on three public datasets reveal that SRGFormer surpasses previous benchmark models, achieving an average performance improvement of 4.47 percent on the Sports dataset. The code is publicly available online.

Paper Structure

This paper contains 21 sections, 19 equations, 7 figures, 7 tables, 1 algorithm.

Figures (7)

  • Figure 1: Intuitions of existing MMRec methods and SGFormer.
  • Figure 2: (a) The multimodal interaction and modeling module captures user item representations alongside semantic representations of modal information. (b) The structural information interaction and modeling module enhances the user's structural comprehension. (c) The Fusion and Prediction module amalgamates semantic information from various modalities, collaborative signals, and user structures to forecast user preference scores.
  • Figure 3: Visual Contrast: AnchorEdges and BasicEdges.
  • Figure 4: The performance of hyperparameter head on the Baby, Sports, and Clothing datasets in terms of Recall@10 and NDCG@10.
  • Figure 5: The performance of hyperparameter $\gamma$ on the Baby, Sports, and Clothing datasets in terms of Recall@10 and NDCG@10.
  • ...and 2 more figures