Table of Contents
Fetching ...

Understanding Generative Recommendation with Semantic IDs from a Model-scaling View

Jingzhe Liu, Liam Collins, Jiliang Tang, Tong Zhao, Neil Shah, Clark Mingxuan Ju

TL;DR

This work reveals fundamental scaling limitations of SID-based Generative Recommendation, showing rapid saturation as RS, LLM encoders, and tokenizers scale, due to the bottleneck in encoding semantic information through SIDs. It introduces a mathematically grounded scaling law balancing semantic information and collaborative filtering, and demonstrates that directly using LLMs as recommender systems (LLM-as-RS) yields superior, consistently scalable performance, even surpassing SID-based GR by up to ~20% under the same data budget. The findings challenge the notion that LLMs struggle with CF signals, showing both semantic and CF modeling improve with scale in LLM-as-RS, and they quantify how external CF signals interact with backbone scale. Overall, LLM-as-RS emerges as a promising path toward robust foundation models for generative recommendation, with SID-based GR remaining attractive only under tight efficiency constraints.

Abstract

Recent advancements in generative models have allowed the emergence of a promising paradigm for recommender systems (RS), known as Generative Recommendation (GR), which tries to unify rich item semantics and collaborative filtering signals. One popular modern approach is to use semantic IDs (SIDs), which are discrete codes quantized from the embeddings of modality encoders (e.g., large language or vision models), to represent items in an autoregressive user interaction sequence modeling setup (henceforth, SID-based GR). While generative models in other domains exhibit well-established scaling laws, our work reveals that SID-based GR shows significant bottlenecks while scaling up the model. In particular, the performance of SID-based GR quickly saturates as we enlarge each component: the modality encoder, the quantization tokenizer, and the RS itself. In this work, we identify the limited capacity of SIDs to encode item semantic information as one of the fundamental bottlenecks. Motivated by this observation, as an initial effort to obtain GR models with better scaling behaviors, we revisit another GR paradigm that directly uses large language models (LLMs) as recommenders (henceforth, LLM-as-RS). Our experiments show that the LLM-as-RS paradigm has superior model scaling properties and achieves up to 20 percent improvement over the best achievable performance of SID-based GR through scaling. We also challenge the prevailing belief that LLMs struggle to capture collaborative filtering information, showing that their ability to model user-item interactions improves as LLMs scale up. Our analyses on both SID-based GR and LLMs across model sizes from 44M to 14B parameters underscore the intrinsic scaling limits of SID-based GR and position LLM-as-RS as a promising path toward foundation models for GR.

Understanding Generative Recommendation with Semantic IDs from a Model-scaling View

TL;DR

This work reveals fundamental scaling limitations of SID-based Generative Recommendation, showing rapid saturation as RS, LLM encoders, and tokenizers scale, due to the bottleneck in encoding semantic information through SIDs. It introduces a mathematically grounded scaling law balancing semantic information and collaborative filtering, and demonstrates that directly using LLMs as recommender systems (LLM-as-RS) yields superior, consistently scalable performance, even surpassing SID-based GR by up to ~20% under the same data budget. The findings challenge the notion that LLMs struggle with CF signals, showing both semantic and CF modeling improve with scale in LLM-as-RS, and they quantify how external CF signals interact with backbone scale. Overall, LLM-as-RS emerges as a promising path toward robust foundation models for generative recommendation, with SID-based GR remaining attractive only under tight efficiency constraints.

Abstract

Recent advancements in generative models have allowed the emergence of a promising paradigm for recommender systems (RS), known as Generative Recommendation (GR), which tries to unify rich item semantics and collaborative filtering signals. One popular modern approach is to use semantic IDs (SIDs), which are discrete codes quantized from the embeddings of modality encoders (e.g., large language or vision models), to represent items in an autoregressive user interaction sequence modeling setup (henceforth, SID-based GR). While generative models in other domains exhibit well-established scaling laws, our work reveals that SID-based GR shows significant bottlenecks while scaling up the model. In particular, the performance of SID-based GR quickly saturates as we enlarge each component: the modality encoder, the quantization tokenizer, and the RS itself. In this work, we identify the limited capacity of SIDs to encode item semantic information as one of the fundamental bottlenecks. Motivated by this observation, as an initial effort to obtain GR models with better scaling behaviors, we revisit another GR paradigm that directly uses large language models (LLMs) as recommenders (henceforth, LLM-as-RS). Our experiments show that the LLM-as-RS paradigm has superior model scaling properties and achieves up to 20 percent improvement over the best achievable performance of SID-based GR through scaling. We also challenge the prevailing belief that LLMs struggle to capture collaborative filtering information, showing that their ability to model user-item interactions improves as LLMs scale up. Our analyses on both SID-based GR and LLMs across model sizes from 44M to 14B parameters underscore the intrinsic scaling limits of SID-based GR and position LLM-as-RS as a promising path toward foundation models for GR.

Paper Structure

This paper contains 38 sections, 19 equations, 25 figures, 8 tables, 1 algorithm.

Figures (25)

  • Figure 1: Two GR paradigms we investigate in this paper. SID-based GR first transforms the item textual descriptions into semantic IDs and then trains a transformer to predict the SIDs of the next item, while LLM-as-RS directly takes in the texts and outputs the title of the next item.
  • Figure 2: The recommendation performance with varying RS model sizes ($N_{\text{RS}}$). The performance quickly saturates as $N_{\text{RS}}$ scales up to $10^7$ parameters.
  • Figure 3: The recommendation performance with varying LLM encoder sizes ($N_{\text{LLM}}$). Little to no effective scaling behaviors are observed.
  • Figure 4: Lower: Scaling behaviors of quantization tokenizer when varying the number of codebooks. Upper: Comparison of performances between RS modules of 13M and 21M parameters. Overall, increasing the number of codebooks does not overcome the scaling saturation.
  • Figure 5: Lower: Scaling behaviors of the quantization tokenizer when varying the size of each codebook. Upper: Comparison of performances between RS modules of 13M and 21M parameters. Overall, increasing the the size of each codebook does not overcome the scaling saturation.
  • ...and 20 more figures