Table of Contents
Fetching ...

Masked Diffusion for Generative Recommendation

Kulin Shah, Bhuvesh Kumar, Neil Shah, Liam Collins

TL;DR

This paper tackles the inefficiencies of autoregressive SID-based generative recommendations by introducing MaskGR, a discrete masked diffusion model over SID sequences. MaskGR enables parallel decoding of SID tokens, improves data efficiency, and better captures global relationships among items, yielding strong gains over AR and continuous-diffusion baselines with far fewer inference steps. The work also demonstrates that MaskGR can be extended with dense retrieval, achieving further gains and showing compatibility with existing AR-enhancements. Overall, MaskGR offers a simple, generalizable framework that improves performance, speed, and flexibility in generative recommendation with semantic IDs, with public code available for reproducibility.

Abstract

Generative recommendation (GR) with semantic IDs (SIDs) has emerged as a promising alternative to traditional recommendation approaches due to its performance gains, capitalization on semantic information provided through language model embeddings, and inference and storage efficiency. Existing GR with SIDs works frame the probability of a sequence of SIDs corresponding to a user's interaction history using autoregressive modeling. While this has led to impressive next item prediction performances in certain settings, these autoregressive GR with SIDs models suffer from expensive inference due to sequential token-wise decoding, potentially inefficient use of training data and bias towards learning short-context relationships among tokens. Inspired by recent breakthroughs in NLP, we propose to instead model and learn the probability of a user's sequence of SIDs using masked diffusion. Masked diffusion employs discrete masking noise to facilitate learning the sequence distribution, and models the probability of masked tokens as conditionally independent given the unmasked tokens, allowing for parallel decoding of the masked tokens. We demonstrate through thorough experiments that our proposed method consistently outperforms autoregressive modeling. This performance gap is especially pronounced in data-constrained settings and in terms of coarse-grained recall, consistent with our intuitions. Moreover, our approach allows the flexibility of predicting multiple SIDs in parallel during inference while maintaining superior performance to autoregressive modeling.

Masked Diffusion for Generative Recommendation

TL;DR

This paper tackles the inefficiencies of autoregressive SID-based generative recommendations by introducing MaskGR, a discrete masked diffusion model over SID sequences. MaskGR enables parallel decoding of SID tokens, improves data efficiency, and better captures global relationships among items, yielding strong gains over AR and continuous-diffusion baselines with far fewer inference steps. The work also demonstrates that MaskGR can be extended with dense retrieval, achieving further gains and showing compatibility with existing AR-enhancements. Overall, MaskGR offers a simple, generalizable framework that improves performance, speed, and flexibility in generative recommendation with semantic IDs, with public code available for reproducibility.

Abstract

Generative recommendation (GR) with semantic IDs (SIDs) has emerged as a promising alternative to traditional recommendation approaches due to its performance gains, capitalization on semantic information provided through language model embeddings, and inference and storage efficiency. Existing GR with SIDs works frame the probability of a sequence of SIDs corresponding to a user's interaction history using autoregressive modeling. While this has led to impressive next item prediction performances in certain settings, these autoregressive GR with SIDs models suffer from expensive inference due to sequential token-wise decoding, potentially inefficient use of training data and bias towards learning short-context relationships among tokens. Inspired by recent breakthroughs in NLP, we propose to instead model and learn the probability of a user's sequence of SIDs using masked diffusion. Masked diffusion employs discrete masking noise to facilitate learning the sequence distribution, and models the probability of masked tokens as conditionally independent given the unmasked tokens, allowing for parallel decoding of the masked tokens. We demonstrate through thorough experiments that our proposed method consistently outperforms autoregressive modeling. This performance gap is especially pronounced in data-constrained settings and in terms of coarse-grained recall, consistent with our intuitions. Moreover, our approach allows the flexibility of predicting multiple SIDs in parallel during inference while maintaining superior performance to autoregressive modeling.

Paper Structure

This paper contains 39 sections, 8 equations, 4 figures, 7 tables.

Figures (4)

  • Figure 1: Overview of training and inference of MaskGR. During training, MaskGR randomly masks each SID in the SID sequence with a probability $t \sim \textrm{Unif}[0, 1]$ and feeds the masked sequence into an encoder-only transformer. The model is then optimized to reconstruct the original values of the masked SIDs using a cross-entropy loss applied at the masked positions (see Eq. \ref{['eq:madrec-loss']}). During inference, MaskGR begins with all SIDs of the last item replaced by masks. At each inference step, the partially masked sequence is passed through the network to predict values for all masked positions. The model then selectively unmasks a subset of these positions by retaining their predicted values while keeping the remaining positions masked. This iterative process continues until all SIDs are unmasked.
  • Figure 2: Improved performance gap for coarse-grained retrieval on the Beauty and Sports datasets. The gap in Recall@K between TIGER and MaskGR increases as K increases.
  • Figure 3: Comparison of data efficiency of MaskGR and TIGER by dropping 25%, 37.5%, 50%, 67.5% and 75% of items from each sequence in the training set, while maintaining at least three items in each sequence.
  • Figure 4: Next-$k$ item prediction performance vs number of function evaluations (NFEs) during inference for (Left) $k=1$ on Beauty and (Right) $k=2$ on MovieLens-1M. The AR methods (TIGER and LIGER) must decode tokens sequentially, so they always execute $k \; \times$ (# SIDS/item) NFEs. MaskGR can decode multiple items in parallel, thereby allows trading off performance and efficiency by tuning the NFEs. Moreover, it already outperforms the AR methods with fewer NFEs.