Table of Contents
Fetching ...

Disentangled Graph Variational Auto-Encoder for Multimodal Recommendation with Interpretability

Xin Zhou, Chunyan Miao

TL;DR

This paper tackles the interpretability challenge in multimodal recommendations by introducing DGVAE, a disentangled graph variational auto-encoder that operates on a frozen item-item graph and text-centered multimodal representations. It converts multimodal content into textual signals, learns two sets of disentangled latent factors across prototypes via a simplified residual GCN, and aligns these factors through mutual information maximization, enabling interpretation of user decisions in terms of textual words. Empirically, DGVAE achieves state-of-the-art results across three real-world Amazon datasets and offers a case-study demonstrating its interpretability, while also highlighting the efficiency benefits of a frozen graph. The work advances practical multimodal recommendation by providing a scalable, interpretable framework that connects textual semantics with user-item interactions.

Abstract

Multimodal recommender systems amalgamate multimodal information (e.g., textual descriptions, images) into a collaborative filtering framework to provide more accurate recommendations. While the incorporation of multimodal information could enhance the interpretability of these systems, current multimodal models represent users and items utilizing entangled numerical vectors, rendering them arduous to interpret. To address this, we propose a Disentangled Graph Variational Auto-Encoder (DGVAE) that aims to enhance both model and recommendation interpretability. DGVAE initially projects multimodal information into textual contents, such as converting images to text, by harnessing state-of-the-art multimodal pre-training technologies. It then constructs a frozen item-item graph and encodes the contents and interactions into two sets of disentangled representations utilizing a simplified residual graph convolutional network. DGVAE further regularizes these disentangled representations through mutual information maximization, aligning the representations derived from the interactions between users and items with those learned from textual content. This alignment facilitates the interpretation of user binary interactions via text. Our empirical analysis conducted on three real-world datasets demonstrates that DGVAE significantly surpasses the performance of state-of-the-art baselines by a margin of 10.02%. We also furnish a case study from a real-world dataset to illustrate the interpretability of DGVAE. Code is available at: \url{https://github.com/enoche/DGVAE}.

Disentangled Graph Variational Auto-Encoder for Multimodal Recommendation with Interpretability

TL;DR

This paper tackles the interpretability challenge in multimodal recommendations by introducing DGVAE, a disentangled graph variational auto-encoder that operates on a frozen item-item graph and text-centered multimodal representations. It converts multimodal content into textual signals, learns two sets of disentangled latent factors across prototypes via a simplified residual GCN, and aligns these factors through mutual information maximization, enabling interpretation of user decisions in terms of textual words. Empirically, DGVAE achieves state-of-the-art results across three real-world Amazon datasets and offers a case-study demonstrating its interpretability, while also highlighting the efficiency benefits of a frozen graph. The work advances practical multimodal recommendation by providing a scalable, interpretable framework that connects textual semantics with user-item interactions.

Abstract

Multimodal recommender systems amalgamate multimodal information (e.g., textual descriptions, images) into a collaborative filtering framework to provide more accurate recommendations. While the incorporation of multimodal information could enhance the interpretability of these systems, current multimodal models represent users and items utilizing entangled numerical vectors, rendering them arduous to interpret. To address this, we propose a Disentangled Graph Variational Auto-Encoder (DGVAE) that aims to enhance both model and recommendation interpretability. DGVAE initially projects multimodal information into textual contents, such as converting images to text, by harnessing state-of-the-art multimodal pre-training technologies. It then constructs a frozen item-item graph and encodes the contents and interactions into two sets of disentangled representations utilizing a simplified residual graph convolutional network. DGVAE further regularizes these disentangled representations through mutual information maximization, aligning the representations derived from the interactions between users and items with those learned from textual content. This alignment facilitates the interpretation of user binary interactions via text. Our empirical analysis conducted on three real-world datasets demonstrates that DGVAE significantly surpasses the performance of state-of-the-art baselines by a margin of 10.02%. We also furnish a case study from a real-world dataset to illustrate the interpretability of DGVAE. Code is available at: \url{https://github.com/enoche/DGVAE}.
Paper Structure (20 sections, 17 equations, 7 figures, 7 tables)

This paper contains 20 sections, 17 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: The framework of DGVAE, which fully utilizes multimodal information to construct the word vector and the item-item graph. DGVAE learns its model parameters by reconstructing both the word vector and the rating vector of a user. This figure is best viewed in color.
  • Figure 2: An illustration of interpretability of DGVAE. The left part shows the interacted items by user "A1E3O99XB3BN3W". The middle part visualizes the learned latent prototypes of DGVAE. The right part presents the learned prototypes and the recommended item.
  • Figure 3: Performance of DGVAE compared with various baselines under different cold-start settings.
  • Figure 4: Comparison of DGVAE with its variants.
  • Figure 5: Comparison of DGVAE under different uni-modal features.
  • ...and 2 more figures