Table of Contents
Fetching ...

Q-BERT4Rec: Quantized Semantic-ID Representation Learning for Multimodal Recommendation

Haofeng Huang, Ling Gai

TL;DR

This work tackles the limitation of discrete, semantics-free item IDs in sequential recommender systems by introducing Q-Bert4Rec, a three-stage framework that learns semantic IDs from multimodal content. A dynamic cross-modal fusion module injects textual, visual, and structural signals into item embeddings, which are then discretized by a Residual Vector Quantization (RQ-VAE) module into a shared semantic vocabulary. A multi-mask pretraining strategy further strengthens temporal reasoning, enabling robust cross-domain transfer on Amazon benchmarks. The approach yields consistent, substantial gains over strong baselines and offers interpretable, scalable semantic representations for multimodal sequential recommendation.

Abstract

Sequential recommendation plays a critical role in modern online platforms such as e-commerce, advertising, and content streaming, where accurately predicting users' next interactions is essential for personalization. Recent Transformer-based methods like BERT4Rec have shown strong modeling capability, yet they still rely on discrete item IDs that lack semantic meaning and ignore rich multimodal information (e.g., text and image). This leads to weak generalization and limited interpretability. To address these challenges, we propose Q-Bert4Rec, a multimodal sequential recommendation framework that unifies semantic representation and quantized modeling. Specifically, Q-Bert4Rec consists of three stages: (1) cross-modal semantic injection, which enriches randomly initialized ID embeddings through a dynamic transformer that fuses textual, visual, and structural features; (2) semantic quantization, which discretizes fused representations into meaningful tokens via residual vector quantization; and (3) multi-mask pretraining and fine-tuning, which leverage diverse masking strategies -- span, tail, and multi-region -- to improve sequential understanding. We validate our model on public Amazon benchmarks and demonstrate that Q-Bert4Rec significantly outperforms many strong existing methods, confirming the effectiveness of semantic tokenization for multimodal sequential recommendation. Our source code will be publicly available on GitHub after publishing.

Q-BERT4Rec: Quantized Semantic-ID Representation Learning for Multimodal Recommendation

TL;DR

This work tackles the limitation of discrete, semantics-free item IDs in sequential recommender systems by introducing Q-Bert4Rec, a three-stage framework that learns semantic IDs from multimodal content. A dynamic cross-modal fusion module injects textual, visual, and structural signals into item embeddings, which are then discretized by a Residual Vector Quantization (RQ-VAE) module into a shared semantic vocabulary. A multi-mask pretraining strategy further strengthens temporal reasoning, enabling robust cross-domain transfer on Amazon benchmarks. The approach yields consistent, substantial gains over strong baselines and offers interpretable, scalable semantic representations for multimodal sequential recommendation.

Abstract

Sequential recommendation plays a critical role in modern online platforms such as e-commerce, advertising, and content streaming, where accurately predicting users' next interactions is essential for personalization. Recent Transformer-based methods like BERT4Rec have shown strong modeling capability, yet they still rely on discrete item IDs that lack semantic meaning and ignore rich multimodal information (e.g., text and image). This leads to weak generalization and limited interpretability. To address these challenges, we propose Q-Bert4Rec, a multimodal sequential recommendation framework that unifies semantic representation and quantized modeling. Specifically, Q-Bert4Rec consists of three stages: (1) cross-modal semantic injection, which enriches randomly initialized ID embeddings through a dynamic transformer that fuses textual, visual, and structural features; (2) semantic quantization, which discretizes fused representations into meaningful tokens via residual vector quantization; and (3) multi-mask pretraining and fine-tuning, which leverage diverse masking strategies -- span, tail, and multi-region -- to improve sequential understanding. We validate our model on public Amazon benchmarks and demonstrate that Q-Bert4Rec significantly outperforms many strong existing methods, confirming the effectiveness of semantic tokenization for multimodal sequential recommendation. Our source code will be publicly available on GitHub after publishing.

Paper Structure

This paper contains 19 sections, 11 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Overall framework of the proposed Semantic-ID Quantization. The model fuses multi-modal inputs through a fusion module and maps them into a shared quantized vocabulary space, forming discrete token sequences that serve as a compact and interpretable quantitative language
  • Figure 2: An overview of Q-Bert4Rec. Q-Bert4Rec consists of three main stages: Dynamic Cross-Modal Semantic Injection, Semantic Quantization, and Multi-Mask Pretraining and Fine-Tuning.
  • Figure 3: Analysis of the number of transformer layers
  • Figure 4: Analysis of the impact of different dropout rates
  • Figure 5: Analysis of the impact of different mask probabilities
  • ...and 2 more figures