Table of Contents
Fetching ...

Facet-Aware Multi-Head Mixture-of-Experts Model for Sequential Recommendation

Mingrui Liu, Sixiao Zhang, Cheng Long

TL;DR

FAME addresses the limitation of single-item embeddings in sequential recommendation by modeling items as multiple facets through a facet-aware multi-head mechanism. Each head predicts next-item scores from facet-specific sub-embeddings, and a Mixture-of-Experts self-attention layer disentangles diverse user preferences within each facet, guided by a learnable router. The per-head predictions are fused with a gating mechanism to form a unified score, enabling dynamic emphasis on different facets during recommendation. Empirical results on Beauty, Sports, Toys, and ML-20m show consistent improvements over strong baselines, with ablations confirming the value of the MoE module and head-level predictions. Overall, FAME provides a modular, scalable enhancement to attention-based sequential recommender systems with improved expressiveness and interpretability.

Abstract

Sequential recommendation (SR) systems excel at capturing users' dynamic preferences by leveraging their interaction histories. Most existing SR systems assign a single embedding vector to each item to represent its features, and various types of models are adopted to combine these item embeddings into a sequence representation vector to capture the user intent. However, we argue that this representation alone is insufficient to capture an item's multi-faceted nature (e.g., movie genres, starring actors). Besides, users often exhibit complex and varied preferences within these facets (e.g., liking both action and musical films in the facet of genre), which are challenging to fully represent. To address the issues above, we propose a novel structure called Facet-Aware Multi-Head Mixture-of-Experts Model for Sequential Recommendation (FAME). We leverage sub-embeddings from each head in the last multi-head attention layer to predict the next item separately. This approach captures the potential multi-faceted nature of items without increasing model complexity. A gating mechanism integrates recommendations from each head and dynamically determines their importance. Furthermore, we introduce a Mixture-of-Experts (MoE) network in each attention head to disentangle various user preferences within each facet. Each expert within the MoE focuses on a specific preference. A learnable router network is adopted to compute the importance weight for each expert and aggregate them. We conduct extensive experiments on four public sequential recommendation datasets and the results demonstrate the effectiveness of our method over existing baseline models.

Facet-Aware Multi-Head Mixture-of-Experts Model for Sequential Recommendation

TL;DR

FAME addresses the limitation of single-item embeddings in sequential recommendation by modeling items as multiple facets through a facet-aware multi-head mechanism. Each head predicts next-item scores from facet-specific sub-embeddings, and a Mixture-of-Experts self-attention layer disentangles diverse user preferences within each facet, guided by a learnable router. The per-head predictions are fused with a gating mechanism to form a unified score, enabling dynamic emphasis on different facets during recommendation. Empirical results on Beauty, Sports, Toys, and ML-20m show consistent improvements over strong baselines, with ablations confirming the value of the MoE module and head-level predictions. Overall, FAME provides a modular, scalable enhancement to attention-based sequential recommender systems with improved expressiveness and interpretability.

Abstract

Sequential recommendation (SR) systems excel at capturing users' dynamic preferences by leveraging their interaction histories. Most existing SR systems assign a single embedding vector to each item to represent its features, and various types of models are adopted to combine these item embeddings into a sequence representation vector to capture the user intent. However, we argue that this representation alone is insufficient to capture an item's multi-faceted nature (e.g., movie genres, starring actors). Besides, users often exhibit complex and varied preferences within these facets (e.g., liking both action and musical films in the facet of genre), which are challenging to fully represent. To address the issues above, we propose a novel structure called Facet-Aware Multi-Head Mixture-of-Experts Model for Sequential Recommendation (FAME). We leverage sub-embeddings from each head in the last multi-head attention layer to predict the next item separately. This approach captures the potential multi-faceted nature of items without increasing model complexity. A gating mechanism integrates recommendations from each head and dynamically determines their importance. Furthermore, we introduce a Mixture-of-Experts (MoE) network in each attention head to disentangle various user preferences within each facet. Each expert within the MoE focuses on a specific preference. A learnable router network is adopted to compute the importance weight for each expert and aggregate them. We conduct extensive experiments on four public sequential recommendation datasets and the results demonstrate the effectiveness of our method over existing baseline models.

Paper Structure

This paper contains 29 sections, 15 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: A motivation example.
  • Figure 2: Overview of the proposed model: (a) illustrates the original Transformer block, while (b) depicts the architecture of our proposed FAME model. For simplicity, the LayerNorm and Dropout operations following the FFN (FFN') are omitted from the Figure
  • Figure 3: MoE Self-Attention Network: Integrated Item Representation Calculation. This diagram visualizes the computational process for determining the integrated item representation of the final item ($f_{t}^{(h)}$) within a specific head ($h$) of our proposed model.
  • Figure 4: An example on attention scores distribution and recommendation results among different experts on genre-focused head
  • Figure 5: The performances comparison varying the number of heads in each dataset. The metric in (a)-(d) is NDCG@20, and the metric in (e)-(h) is HR@$20$.
  • ...and 2 more figures