Table of Contents
Fetching ...

CAMMSR: Category-Guided Attentive Mixture of Experts for Multimodal Sequential Recommendation

Jinfeng Xu, Zheyu Chen, Shuo Yang, Jinze Li, Hewei Wang, Yijie Li, Jianheng Tang, Yunhuai Liu, Edith C. H. Ngai

TL;DR

CAMMSR introduces a category-guided attentive mixture of experts (CAMoE) module, which learns specialized item representations from multiple perspectives and explicitly models inter-modal synergies, validating its effectiveness in achieving adaptive, synergistic, and user-centric multimodal sequential recommendation.

Abstract

The explosion of multimedia data in information-rich environments has intensified the challenges of personalized content discovery, positioning recommendation systems as an essential form of passive data management. Multimodal sequential recommendation, which leverages diverse item information such as text and images, has shown great promise in enriching item representations and deepening the understanding of user interests. However, most existing models rely on heuristic fusion strategies that fail to capture the dynamic and context-sensitive nature of user-modal interactions. In real-world scenarios, user preferences for modalities vary not only across individuals but also within the same user across different items or categories. Moreover, the synergistic effects between modalities-where combined signals trigger user interest in ways isolated modalities cannot-remain largely underexplored. To this end, we propose CAMMSR, a Category-guided Attentive Mixture of Experts model for Multimodal Sequential Recommendation. At its core, CAMMSR introduces a category-guided attentive mixture of experts (CAMoE) module, which learns specialized item representations from multiple perspectives and explicitly models inter-modal synergies. This component dynamically allocates modality weights guided by an auxiliary category prediction task, enabling adaptive fusion of multimodal signals. Additionally, we design a modality swap contrastive learning task to enhance cross-modal representation alignment through sequence-level augmentation. Extensive experiments on four public datasets demonstrate that CAMMSR consistently outperforms state-of-the-art baselines, validating its effectiveness in achieving adaptive, synergistic, and user-centric multimodal sequential recommendation.

CAMMSR: Category-Guided Attentive Mixture of Experts for Multimodal Sequential Recommendation

TL;DR

CAMMSR introduces a category-guided attentive mixture of experts (CAMoE) module, which learns specialized item representations from multiple perspectives and explicitly models inter-modal synergies, validating its effectiveness in achieving adaptive, synergistic, and user-centric multimodal sequential recommendation.

Abstract

The explosion of multimedia data in information-rich environments has intensified the challenges of personalized content discovery, positioning recommendation systems as an essential form of passive data management. Multimodal sequential recommendation, which leverages diverse item information such as text and images, has shown great promise in enriching item representations and deepening the understanding of user interests. However, most existing models rely on heuristic fusion strategies that fail to capture the dynamic and context-sensitive nature of user-modal interactions. In real-world scenarios, user preferences for modalities vary not only across individuals but also within the same user across different items or categories. Moreover, the synergistic effects between modalities-where combined signals trigger user interest in ways isolated modalities cannot-remain largely underexplored. To this end, we propose CAMMSR, a Category-guided Attentive Mixture of Experts model for Multimodal Sequential Recommendation. At its core, CAMMSR introduces a category-guided attentive mixture of experts (CAMoE) module, which learns specialized item representations from multiple perspectives and explicitly models inter-modal synergies. This component dynamically allocates modality weights guided by an auxiliary category prediction task, enabling adaptive fusion of multimodal signals. Additionally, we design a modality swap contrastive learning task to enhance cross-modal representation alignment through sequence-level augmentation. Extensive experiments on four public datasets demonstrate that CAMMSR consistently outperforms state-of-the-art baselines, validating its effectiveness in achieving adaptive, synergistic, and user-centric multimodal sequential recommendation.
Paper Structure (29 sections, 20 equations, 6 figures, 7 tables, 1 algorithm)

This paper contains 29 sections, 20 equations, 6 figures, 7 tables, 1 algorithm.

Figures (6)

  • Figure 1: The overall framework of CAMMSR. The left-hand side provides a description of the procedure, while the right-hand side details each component. 1) Item Representation Initialization initializes item multimodal representations using a pre-trained extractor with positional sequences. 2) Category-Guided Attentive Mixture of Experts leverages an additional category prediction task and attention mechanisms to effectively allocate modality weights. 3) Modality Swap Contrastive Learning constructs augmented contrastive views through modality swap operations and applies contrastive learning to enhance the alignment between different modality representations. 4) User Interest Learning models user behavior sequences with a Transformer-based encoder and explores DyT as an alternative to LayerNorm in Transformers.
  • Figure 2: Performance comparison for CAMMSR and all variants across all four datasets.
  • Figure 3: A case study for purchase sequence and CAMMSR and IISAN prediction results from the Beauty Dataset.
  • Figure 4: Performance comparison of removing each modality.
  • Figure 5: Efficiency study on both Games and Beauty datasets.
  • ...and 1 more figures