Table of Contents
Fetching ...

Modality-Independent Graph Neural Networks with Global Transformers for Multimodal Recommendation

Jun Hu, Bryan Hooi, Bingsheng He, Yinwei Wei

TL;DR

This work tackles multimodal recommendation on user-item graphs by recognizing that different modalities propagate most effectively with different receptive fields. It introduces MIG-GT, which combines Modality-Independent Receptive Fields (MIRF)—using separate GNNs with modality-specific hop counts $K^{(E)}$, $K^{(T)}$, $K^{(V)}$—with a Sampling-based Global Transformer (SGT) to inject global context efficiently. A Transformer Unsmooth Regularization (TUR) and an L2 term support stable optimization. Empirical results on three Amazon datasets show MIG-GT achieving state-of-the-art or competitive performance without relying on denoising or explicit item-item modeling, and ablations demonstrate the effectiveness of MIRF and SGT, with the code publicly available.

Abstract

Multimodal recommendation systems can learn users' preferences from existing user-item interactions as well as the semantics of multimodal data associated with items. Many existing methods model this through a multimodal user-item graph, approaching multimodal recommendation as a graph learning task. Graph Neural Networks (GNNs) have shown promising performance in this domain. Prior research has capitalized on GNNs' capability to capture neighborhood information within certain receptive fields (typically denoted by the number of hops, $K$) to enrich user and item semantics. We observe that the optimal receptive fields for GNNs can vary across different modalities. In this paper, we propose GNNs with Modality-Independent Receptive Fields, which employ separate GNNs with independent receptive fields for different modalities to enhance performance. Our results indicate that the optimal $K$ for certain modalities on specific datasets can be as low as 1 or 2, which may restrict the GNNs' capacity to capture global information. To address this, we introduce a Sampling-based Global Transformer, which utilizes uniform global sampling to effectively integrate global information for GNNs. We conduct comprehensive experiments that demonstrate the superiority of our approach over existing methods. Our code is publicly available at https://github.com/CrawlScript/MIG-GT.

Modality-Independent Graph Neural Networks with Global Transformers for Multimodal Recommendation

TL;DR

This work tackles multimodal recommendation on user-item graphs by recognizing that different modalities propagate most effectively with different receptive fields. It introduces MIG-GT, which combines Modality-Independent Receptive Fields (MIRF)—using separate GNNs with modality-specific hop counts , , —with a Sampling-based Global Transformer (SGT) to inject global context efficiently. A Transformer Unsmooth Regularization (TUR) and an L2 term support stable optimization. Empirical results on three Amazon datasets show MIG-GT achieving state-of-the-art or competitive performance without relying on denoising or explicit item-item modeling, and ablations demonstrate the effectiveness of MIRF and SGT, with the code publicly available.

Abstract

Multimodal recommendation systems can learn users' preferences from existing user-item interactions as well as the semantics of multimodal data associated with items. Many existing methods model this through a multimodal user-item graph, approaching multimodal recommendation as a graph learning task. Graph Neural Networks (GNNs) have shown promising performance in this domain. Prior research has capitalized on GNNs' capability to capture neighborhood information within certain receptive fields (typically denoted by the number of hops, ) to enrich user and item semantics. We observe that the optimal receptive fields for GNNs can vary across different modalities. In this paper, we propose GNNs with Modality-Independent Receptive Fields, which employ separate GNNs with independent receptive fields for different modalities to enhance performance. Our results indicate that the optimal for certain modalities on specific datasets can be as low as 1 or 2, which may restrict the GNNs' capacity to capture global information. To address this, we introduce a Sampling-based Global Transformer, which utilizes uniform global sampling to effectively integrate global information for GNNs. We conduct comprehensive experiments that demonstrate the superiority of our approach over existing methods. Our code is publicly available at https://github.com/CrawlScript/MIG-GT.

Paper Structure

This paper contains 25 sections, 12 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Performance of GNNs on Amazon Baby with features of different modalities at varying receptive fields (number of hops, $K$). "Emb" stands for learnable embeddings. The optimal $K$ is modality-dependent: Emb and Text perform best at $K=3$, while Visual performs best at $K=2$.
  • Figure 2: Examples of GNNs and Transformers.
  • Figure 3: Overall Framework of Modality-Independent Graph Neural Networks with Global Transformers (MIG-GT).
  • Figure 4: Heatmaps showing the NDCG@20 scores for different combinations of ${K^{(T)}}$ and ${K^{(V)}}$.
  • Figure 5: Heatmaps showing the NDCG@20 scores for different combinations of ${K^{(E)}}$ and ${K^{(V)}}$.
  • ...and 3 more figures