Table of Contents
Fetching ...

MMGRec: Multimodal Generative Recommendation with Transformer Model

Han Liu, Yinwei Wei, Xuemeng Song, Weili Guan, Yuan-Fang Li, Liqiang Nie

TL;DR

MMGRec introduces a novel generative paradigm for multimodal recommendation by defining Rec-ID as a semantically rich token sequence augmented with a popularity token. It pairs Graph RQ-VAE-based Rec-ID assignment, which fuses multimodal content and collaborative signals via a graph neural network and hierarchical vector quantization, with a Transformer-based Rec-ID generation module that autoregressively predicts item IDs. A relation-aware self-attention mechanism enables the model to capture user-specific relationships in non-sequential interaction histories, improving personalization. Experiments on three real-world datasets demonstrate state-of-the-art performance and favorable inference efficiency, validating the benefits of moving from embed-and-retrieve to generation. The work also provides insights into collision handling, Rec-ID design choices, and practical considerations for scalability and future integration with larger models.

Abstract

Multimodal recommendation aims to recommend user-preferred candidates based on her/his historically interacted items and associated multimodal information. Previous studies commonly employ an embed-and-retrieve paradigm: learning user and item representations in the same embedding space, then retrieving similar candidate items for a user via embedding inner product. However, this paradigm suffers from inference cost, interaction modeling, and false-negative issues. Toward this end, we propose a new MMGRec model to introduce a generative paradigm into multimodal recommendation. Specifically, we first devise a hierarchical quantization method Graph RQ-VAE to assign Rec-ID for each item from its multimodal and CF information. Consisting of a tuple of semantically meaningful tokens, Rec-ID serves as the unique identifier of each item. Afterward, we train a Transformer-based recommender to generate the Rec-IDs of user-preferred items based on historical interaction sequences. The generative paradigm is qualified since this model systematically predicts the tuple of tokens identifying the recommended item in an autoregressive manner. Moreover, a relation-aware self-attention mechanism is devised for the Transformer to handle non-sequential interaction sequences, which explores the element pairwise relation to replace absolute positional encoding. Extensive experiments evaluate MMGRec's effectiveness compared with state-of-the-art methods.

MMGRec: Multimodal Generative Recommendation with Transformer Model

TL;DR

MMGRec introduces a novel generative paradigm for multimodal recommendation by defining Rec-ID as a semantically rich token sequence augmented with a popularity token. It pairs Graph RQ-VAE-based Rec-ID assignment, which fuses multimodal content and collaborative signals via a graph neural network and hierarchical vector quantization, with a Transformer-based Rec-ID generation module that autoregressively predicts item IDs. A relation-aware self-attention mechanism enables the model to capture user-specific relationships in non-sequential interaction histories, improving personalization. Experiments on three real-world datasets demonstrate state-of-the-art performance and favorable inference efficiency, validating the benefits of moving from embed-and-retrieve to generation. The work also provides insights into collision handling, Rec-ID design choices, and practical considerations for scalability and future integration with larger models.

Abstract

Multimodal recommendation aims to recommend user-preferred candidates based on her/his historically interacted items and associated multimodal information. Previous studies commonly employ an embed-and-retrieve paradigm: learning user and item representations in the same embedding space, then retrieving similar candidate items for a user via embedding inner product. However, this paradigm suffers from inference cost, interaction modeling, and false-negative issues. Toward this end, we propose a new MMGRec model to introduce a generative paradigm into multimodal recommendation. Specifically, we first devise a hierarchical quantization method Graph RQ-VAE to assign Rec-ID for each item from its multimodal and CF information. Consisting of a tuple of semantically meaningful tokens, Rec-ID serves as the unique identifier of each item. Afterward, we train a Transformer-based recommender to generate the Rec-IDs of user-preferred items based on historical interaction sequences. The generative paradigm is qualified since this model systematically predicts the tuple of tokens identifying the recommended item in an autoregressive manner. Moreover, a relation-aware self-attention mechanism is devised for the Transformer to handle non-sequential interaction sequences, which explores the element pairwise relation to replace absolute positional encoding. Extensive experiments evaluate MMGRec's effectiveness compared with state-of-the-art methods.
Paper Structure (30 sections, 19 equations, 6 figures, 5 tables)

This paper contains 30 sections, 19 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: An illustration of Graph RQ-VAE model architecture.
  • Figure 2: Schematic illustration of Transformer encoder-decoder setup for building our generative recommendation model.
  • Figure 3: An illustration of relation-aware self-attention and its corresponding multi-head attention. Therein, $\mathbf{Q}=\mathbf{E}\mathbf{W}^Q$, $\mathbf{Q}_u=\mathbf{E}\mathbf{W}^Q_u$, and the remaining parameters are computed analogically.
  • Figure 4: Effect of layer and head numbers.
  • Figure 5: Inference time on different-scale Kwai dataset. The unit of time is milliseconds (ms).
  • ...and 1 more figures