Table of Contents
Fetching ...

MMQ: Multimodal Mixture-of-Quantization Tokenization for Semantic ID Generation and User Behavioral Adaptation

Yi Xu, Moyu Zhang, Chenxuan Li, Zhihao Liao, Haibo Xing, Hao Deng, Jinxin Hu, Yu Zhang, Xiaoyi Zeng, Jing Zhang

TL;DR

MMQ tackles the challenge of scalable item representation in recommender systems by proposing semantic IDs generated from multimodal content. It introduces a two-stage framework: (1) a Multimodal Shared-Specific Tokenizer Training Stage with a multi-expert architecture and orthogonal regularization to capture both cross-modal synergy and modality-specific cues, and (2) a Behavior-aware Fine-tuning Stage that uses differentiable soft indexing to align semantic IDs with downstream behavioral objectives. The approach achieves state-of-the-art performance in both generative retrieval and discriminative ranking on industrial and public datasets, with online A/B tests showing tangible business gains. MMQ demonstrates strong generalization, robustness to long-tail items, and scalability with longer semantic-ID sequences, making it a practical solution for large, dynamic item corpora and cross-domain applications.

Abstract

Recommender systems traditionally represent items using unique identifiers (ItemIDs), but this approach struggles with large, dynamic item corpora and sparse long-tail data, limiting scalability and generalization. Semantic IDs, derived from multimodal content such as text and images, offer a promising alternative by mapping items into a shared semantic space, enabling knowledge transfer and improving recommendations for new or rare items. However, existing methods face two key challenges: (1) balancing cross-modal synergy with modality-specific uniqueness, and (2) bridging the semantic-behavioral gap, where semantic representations may misalign with actual user preferences. To address these challenges, we propose Multimodal Mixture-of-Quantization (MMQ), a two-stage framework that trains a novel multimodal tokenizer. First, a shared-specific tokenizer leverages a multi-expert architecture with modality-specific and modality-shared experts, using orthogonal regularization to capture comprehensive multimodal information. Second, behavior-aware fine-tuning dynamically adapts semantic IDs to downstream recommendation objectives while preserving modality information through a multimodal reconstruction loss. Extensive offline experiments and online A/B tests demonstrate that MMQ effectively unifies multimodal synergy, specificity, and behavioral adaptation, providing a scalable and versatile solution for both generative retrieval and discriminative ranking tasks.

MMQ: Multimodal Mixture-of-Quantization Tokenization for Semantic ID Generation and User Behavioral Adaptation

TL;DR

MMQ tackles the challenge of scalable item representation in recommender systems by proposing semantic IDs generated from multimodal content. It introduces a two-stage framework: (1) a Multimodal Shared-Specific Tokenizer Training Stage with a multi-expert architecture and orthogonal regularization to capture both cross-modal synergy and modality-specific cues, and (2) a Behavior-aware Fine-tuning Stage that uses differentiable soft indexing to align semantic IDs with downstream behavioral objectives. The approach achieves state-of-the-art performance in both generative retrieval and discriminative ranking on industrial and public datasets, with online A/B tests showing tangible business gains. MMQ demonstrates strong generalization, robustness to long-tail items, and scalability with longer semantic-ID sequences, making it a practical solution for large, dynamic item corpora and cross-domain applications.

Abstract

Recommender systems traditionally represent items using unique identifiers (ItemIDs), but this approach struggles with large, dynamic item corpora and sparse long-tail data, limiting scalability and generalization. Semantic IDs, derived from multimodal content such as text and images, offer a promising alternative by mapping items into a shared semantic space, enabling knowledge transfer and improving recommendations for new or rare items. However, existing methods face two key challenges: (1) balancing cross-modal synergy with modality-specific uniqueness, and (2) bridging the semantic-behavioral gap, where semantic representations may misalign with actual user preferences. To address these challenges, we propose Multimodal Mixture-of-Quantization (MMQ), a two-stage framework that trains a novel multimodal tokenizer. First, a shared-specific tokenizer leverages a multi-expert architecture with modality-specific and modality-shared experts, using orthogonal regularization to capture comprehensive multimodal information. Second, behavior-aware fine-tuning dynamically adapts semantic IDs to downstream recommendation objectives while preserving modality information through a multimodal reconstruction loss. Extensive offline experiments and online A/B tests demonstrate that MMQ effectively unifies multimodal synergy, specificity, and behavioral adaptation, providing a scalable and versatile solution for both generative retrieval and discriminative ranking tasks.

Paper Structure

This paper contains 22 sections, 17 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Illustration of text and vision modal interaction.
  • Figure 2: Our Proposed MMQ Framework. (1) Multimodal Shared-Specific Tokenizer Training Stage: We introduce a multi-expert architecture with modality-specific and modality-shared experts to explicitly model both unique and synergistic information. In addition, we leverage orthogonal regularization to encourage expert diversity, preventing expert collapse and ensuring comprehensive representation. (2) Behavior-Aware Fine-Tuning Stage: We adapt semantic ID clusters dynamically using downstream recommendation objectives, bridging the semantic-behavioral gap.
  • Figure 3: Item Popularity Stratified Performance Comparison.
  • Figure 4: The Compatibility Experiments on Integrating Behavior-Aware Fine-tuning with RQ-VAE.
  • Figure 5: The scalability of the semantic ID length.
  • ...and 1 more figures