Table of Contents
Fetching ...

BBQRec: Behavior-Bind Quantization for Multi-Modal Sequential Recommendation

Kaiyuan Li, Rui Xiang, Yong Bai, Yongxiang Tang, Yanhua Cheng, Xialong Liu, Peng Jiang, Kun Gai

TL;DR

BBQRec tackles sparsity in multi-modal sequential recommendation by aligning behavior with multi-modal signals through a shared, behavior-dominant quantization codebook and CLUB-based mutual information minimization. It pairs this with a non-invasive, similarity-enhanced self-attention mechanism that discretizes cross-modal similarities to reweight attention without overwriting the core sequence model. Key innovations include a behavior disentanglement module, a residual quantization strategy with an extra learnable ID, and a sequence-to-item contrastive objective that guides representations toward behavioral relevance. Extensive experiments on four real-world datasets show BBQRec consistently outperforms state-of-the-art baselines, validating the effectiveness of both behavior-aligned quantization and non-invasive semantic integration for multi-modal generative sequential recommendation.

Abstract

Multi-modal sequential recommendation systems leverage auxiliary signals (e.g., text, images) to alleviate data sparsity in user-item interactions. While recent methods exploit large language models to encode modalities into discrete semantic IDs for autoregressive prediction, we identify two critical limitations: (1) Existing approaches adopt fragmented quantization, where modalities are independently mapped to semantic spaces misaligned with behavioral objectives, and (2) Over-reliance on semantic IDs disrupts inter-modal semantic coherence, thereby weakening the expressive power of multi-modal representations for modeling diverse user preferences. To address these challenges, we propose a Behavior-Bind multi-modal Quantization for Sequential Recommendation (BBQRec for short) featuring dual-aligned quantization and semantics-aware sequence modeling. First, our behavior-semantic alignment module disentangles modality-agnostic behavioral patterns from noisy modality-specific features through contrastive codebook learning, ensuring semantic IDs are inherently tied to recommendation tasks. Second, we design a discretized similarity reweighting mechanism that dynamically adjusts self-attention scores using quantized semantic relationships, preserving multi-modal synergies while avoiding invasive modifications to the sequence modeling architecture. Extensive evaluations across four real-world benchmarks demonstrate BBQRec's superiority over the state-of-the-art baselines.

BBQRec: Behavior-Bind Quantization for Multi-Modal Sequential Recommendation

TL;DR

BBQRec tackles sparsity in multi-modal sequential recommendation by aligning behavior with multi-modal signals through a shared, behavior-dominant quantization codebook and CLUB-based mutual information minimization. It pairs this with a non-invasive, similarity-enhanced self-attention mechanism that discretizes cross-modal similarities to reweight attention without overwriting the core sequence model. Key innovations include a behavior disentanglement module, a residual quantization strategy with an extra learnable ID, and a sequence-to-item contrastive objective that guides representations toward behavioral relevance. Extensive experiments on four real-world datasets show BBQRec consistently outperforms state-of-the-art baselines, validating the effectiveness of both behavior-aligned quantization and non-invasive semantic integration for multi-modal generative sequential recommendation.

Abstract

Multi-modal sequential recommendation systems leverage auxiliary signals (e.g., text, images) to alleviate data sparsity in user-item interactions. While recent methods exploit large language models to encode modalities into discrete semantic IDs for autoregressive prediction, we identify two critical limitations: (1) Existing approaches adopt fragmented quantization, where modalities are independently mapped to semantic spaces misaligned with behavioral objectives, and (2) Over-reliance on semantic IDs disrupts inter-modal semantic coherence, thereby weakening the expressive power of multi-modal representations for modeling diverse user preferences. To address these challenges, we propose a Behavior-Bind multi-modal Quantization for Sequential Recommendation (BBQRec for short) featuring dual-aligned quantization and semantics-aware sequence modeling. First, our behavior-semantic alignment module disentangles modality-agnostic behavioral patterns from noisy modality-specific features through contrastive codebook learning, ensuring semantic IDs are inherently tied to recommendation tasks. Second, we design a discretized similarity reweighting mechanism that dynamically adjusts self-attention scores using quantized semantic relationships, preserving multi-modal synergies while avoiding invasive modifications to the sequence modeling architecture. Extensive evaluations across four real-world benchmarks demonstrate BBQRec's superiority over the state-of-the-art baselines.

Paper Structure

This paper contains 22 sections, 17 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: A simple diagram illustrates the key differences between our proposed method and existing approaches.
  • Figure 2: The overall architecture of Behavior-aligned Multi-modal Quantized framework for Sequential Recommendation (BBQRec for short). TS/IS refers to the text/image-specific encoder, TB/IB denotes the behavior-aligned text/image encoder, and TR/IR represents the text/image decoder.
  • Figure 3: Performance of BBQRec on the Beauty Dataset with Different hyper-parameters.
  • Figure 4: The results of the three variants of BBQRec, namely $\textsc{BBQRec}\xspace_{\neg{\text{ID}}}$, $\textsc{BBQRec}\xspace_{\neg{\text{text}}}$, and $\textsc{BBQRec}\xspace_{\neg{\text{image}}}$, on four datasets.