Table of Contents
Fetching ...

Grounded Token Initialization for New Vocabulary in LMs for Generative Recommendation

Daiwei Chen, Zhoutong Fu, Chengming Jiang, Haichao Zhang, Ran Zhou, Tan Wang, Chunnan Yao, Guoyao Li, Rui Cai, Yihan Cao, Ruijie Jiang, Fedor Borisyuk, Jianqiang Shen, Jingwei Wu, Ramya Korlakai Vinayak

Abstract

Language models (LMs) are increasingly extended with new learnable vocabulary tokens for domain-specific tasks, such as Semantic-ID tokens in generative recommendation. The standard practice initializes these new tokens as the mean of existing vocabulary embeddings, then relies on supervised fine-tuning to learn their representations. We present a systematic analysis of this strategy: through spectral and geometric diagnostics, we show that mean initialization collapses all new tokens into a degenerate subspace, erasing inter-token distinctions that subsequent fine-tuning struggles to fully recover. These findings suggest that \emph{token initialization} is a key bottleneck when extending LMs with new vocabularies. Motivated by this diagnosis, we propose the \emph{Grounded Token Initialization Hypothesis}: linguistically grounding novel tokens in the pretrained embedding space before fine-tuning better enables the model to leverage its general-purpose knowledge for novel-token domains. We operationalize this hypothesis as GTI (Grounded Token Initialization), a lightweight grounding stage that, prior to fine-tuning, maps new tokens to distinct, semantically meaningful locations in the pretrained embedding space using only paired linguistic supervision. Despite its simplicity, GTI outperforms both mean initialization and existing auxiliary-task adaptation methods in the majority of evaluation settings across multiple generative recommendation benchmarks, including industry-scale and public datasets. Further analyses show that grounded embeddings produce richer inter-token structure that persists through fine-tuning, corroborating the hypothesis that initialization quality is a key bottleneck in vocabulary extension.

Grounded Token Initialization for New Vocabulary in LMs for Generative Recommendation

Abstract

Language models (LMs) are increasingly extended with new learnable vocabulary tokens for domain-specific tasks, such as Semantic-ID tokens in generative recommendation. The standard practice initializes these new tokens as the mean of existing vocabulary embeddings, then relies on supervised fine-tuning to learn their representations. We present a systematic analysis of this strategy: through spectral and geometric diagnostics, we show that mean initialization collapses all new tokens into a degenerate subspace, erasing inter-token distinctions that subsequent fine-tuning struggles to fully recover. These findings suggest that \emph{token initialization} is a key bottleneck when extending LMs with new vocabularies. Motivated by this diagnosis, we propose the \emph{Grounded Token Initialization Hypothesis}: linguistically grounding novel tokens in the pretrained embedding space before fine-tuning better enables the model to leverage its general-purpose knowledge for novel-token domains. We operationalize this hypothesis as GTI (Grounded Token Initialization), a lightweight grounding stage that, prior to fine-tuning, maps new tokens to distinct, semantically meaningful locations in the pretrained embedding space using only paired linguistic supervision. Despite its simplicity, GTI outperforms both mean initialization and existing auxiliary-task adaptation methods in the majority of evaluation settings across multiple generative recommendation benchmarks, including industry-scale and public datasets. Further analyses show that grounded embeddings produce richer inter-token structure that persists through fine-tuning, corroborating the hypothesis that initialization quality is a key bottleneck in vocabulary extension.

Paper Structure

This paper contains 40 sections, 5 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Overview of the GTI grounding stage. The LM backbone and original vocabulary embeddings are frozen (snowflake); only the newly introduced Semantic-ID (SID) token embeddings ($|\mathcal{V}_{\mathrm{SID}}|\!\times\! D$ parameters, fire) are trained. Paired prompts map between natural language descriptions and SID tokens in both directions, grounding the new tokens in the pretrained embedding space. This stage is inserted before standard end-to-end fine-tuning (see Section \ref{['sec:method']}).
  • Figure 2: Token-embedding collapse under mean initialization and the effect of grounding. (a) Left: Mean initialization maps all SID tokens (white triangles) to a single point, collapsing inter-token distinctions. Top-right:GTI grounds SID tokens (colored triangles) into distinct regions by training only the $|\mathcal{V}_{\mathrm{SID}}|\!\times\!d$ embedding parameters while freezing the backbone. Bottom-right: Fine-tuning without grounding does not fully resolve the collapse (see Figure \ref{['fig:svd_rsa']}). (b)&(c) GTI initialization yields higher effective rank and preserves blockwise hierarchical structure among SID tokens after downstream task supervised finetuning.
  • Figure 3: Relative gain versus candidate pool size.Left/Middle: Relative Precision@K gain under Good Match and Good & Maybe Match; Right: Relative NDCG@K gain (Composite). GTI consistently outperforms both baselines across all pool sizes, with the largest gains at small $K$. Shaded areas denote variability across runs.
  • Figure 4: Relative gain versus candidate pool size.Left: Relative Recall@K gain; Right: Relative NDCG@K gain. Shaded areas denote variability across runs.
  • Figure 5: Pairwise cosine-similarity matrices under three initialization strategies. Each matrix shows similarities between 50 pretrained tokens (upper-left block) and 50 SID tokens (bottom-right block). Random initialization (left) yields noninformative SID embeddings. Mean initialization (middle) collapses SID tokens into a near-uniform block. GTI (right) produces differentiated intra-SID structure with meaningful affinities to pretrained tokens.
  • ...and 3 more figures