Table of Contents
Fetching ...

A Unified Graph Transformer for Overcoming Isolations in Multi-modal Recommendation

Zixuan Yi, Iadh Ounis

TL;DR

This paper introduces the Unified Graph Transformer (UGT), a cohesive, end-to-end framework for multi-modal recommendation that jointly optimizes feature extraction and fusion. By coupling a multi-way transformer for aligned visual and textual feature extraction with a unified GNN that fuses these features with user-item signals via an attentive fusion mechanism, UGT addresses the isolation of extraction and modality encoding seen in prior works. The model is trained with a joint loss combining Bayesian Personalized Ranking and an image-text contrastive objective, and is validated on three Amazon datasets, where it consistently outperforms nine strong baselines with significant gains (up to about 14% on Recall@20). Visualization and ablation studies corroborate that the unified architecture yields tighter modality alignment and more effective user/item representations, highlighting the practical impact of end-to-end multi-modal fusion in recommender systems.

Abstract

With the rapid development of online multimedia services, especially in e-commerce platforms, there is a pressing need for personalised recommendation systems that can effectively encode the diverse multi-modal content associated with each item. However, we argue that existing multi-modal recommender systems typically use isolated processes for both feature extraction and modality modelling. Such isolated processes can harm the recommendation performance. Firstly, an isolated extraction process underestimates the importance of effective feature extraction in multi-modal recommendations, potentially incorporating non-relevant information, which is harmful to item representations. Second, an isolated modality modelling process produces disjointed embeddings for item modalities due to the individual processing of each modality, which leads to a suboptimal fusion of user/item representations for effective user preferences prediction. We hypothesise that the use of a unified model for addressing both aforementioned isolated processes will enable the consistent extraction and cohesive fusion of joint multi-modal features, thereby enhancing the effectiveness of multi-modal recommender systems. In this paper, we propose a novel model, called Unified Multi-modal Graph Transformer (UGT), which firstly leverages a multi-way transformer to extract aligned multi-modal features from raw data for top-k recommendation. Subsequently, we build a unified graph neural network in our UGT model to jointly fuse the user/item representations with their corresponding multi-modal features. Using the graph transformer architecture of our UGT model, we show that the UGT model can achieve significant effectiveness gains, especially when jointly optimised with the commonly-used multi-modal recommendation losses.

A Unified Graph Transformer for Overcoming Isolations in Multi-modal Recommendation

TL;DR

This paper introduces the Unified Graph Transformer (UGT), a cohesive, end-to-end framework for multi-modal recommendation that jointly optimizes feature extraction and fusion. By coupling a multi-way transformer for aligned visual and textual feature extraction with a unified GNN that fuses these features with user-item signals via an attentive fusion mechanism, UGT addresses the isolation of extraction and modality encoding seen in prior works. The model is trained with a joint loss combining Bayesian Personalized Ranking and an image-text contrastive objective, and is validated on three Amazon datasets, where it consistently outperforms nine strong baselines with significant gains (up to about 14% on Recall@20). Visualization and ablation studies corroborate that the unified architecture yields tighter modality alignment and more effective user/item representations, highlighting the practical impact of end-to-end multi-modal fusion in recommender systems.

Abstract

With the rapid development of online multimedia services, especially in e-commerce platforms, there is a pressing need for personalised recommendation systems that can effectively encode the diverse multi-modal content associated with each item. However, we argue that existing multi-modal recommender systems typically use isolated processes for both feature extraction and modality modelling. Such isolated processes can harm the recommendation performance. Firstly, an isolated extraction process underestimates the importance of effective feature extraction in multi-modal recommendations, potentially incorporating non-relevant information, which is harmful to item representations. Second, an isolated modality modelling process produces disjointed embeddings for item modalities due to the individual processing of each modality, which leads to a suboptimal fusion of user/item representations for effective user preferences prediction. We hypothesise that the use of a unified model for addressing both aforementioned isolated processes will enable the consistent extraction and cohesive fusion of joint multi-modal features, thereby enhancing the effectiveness of multi-modal recommender systems. In this paper, we propose a novel model, called Unified Multi-modal Graph Transformer (UGT), which firstly leverages a multi-way transformer to extract aligned multi-modal features from raw data for top-k recommendation. Subsequently, we build a unified graph neural network in our UGT model to jointly fuse the user/item representations with their corresponding multi-modal features. Using the graph transformer architecture of our UGT model, we show that the UGT model can achieve significant effectiveness gains, especially when jointly optimised with the commonly-used multi-modal recommendation losses.
Paper Structure (22 sections, 6 equations, 5 figures, 3 tables)

This paper contains 22 sections, 6 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Fig.1: Recommendation working flow in exiting multi-modal recommendation models.
  • Figure 2: Fig.2: Our Unified multi-modal Graph Transformer (UGT).
  • Figure 3: Performance of our UGT model with respect to different $\lambda_{\textcolor{black}{c}}$ on the Sports and Clothing datasets.
  • Figure 4: Performance of our UGT model using different $\epsilon$values on the Sports and Clothing datasets.
  • Figure 5: The t-SNE visualisation of the item embeddings on the Sports and Clothing datasets. The star refers to the visual embeddings while the pentagon represents the text embeddings. The average Mean Squared Error (MSE) value indicates the average distance between the visual and textual embeddings.