Variable-Length Semantic IDs for Recommender Systems
Kirill Khrylchenko
TL;DR
The paper tackles the vocabulary and efficiency challenges of large-item recommender catalogs by learning variable-length semantic IDs. It proposes a discrete variational autoencoder with Gumbel-Softmax reparameterization that models a per-item, adaptive code length $L$ drawn from a shared vocabulary, without relying on an EOS symbol. The method unifies ideas from emergent communication with modern recommender architectures, introducing priors over symbols and lengths and a three-term ELBO that balances reconstruction, vocabulary usage, and length—yielding shorter codes for popular items and longer, more expressive codes for long-tail items. Empirically, varlen semantic IDs preserve reconstruction quality, reduce token usage, and improve downstream recall and coverage while remaining more stable than REINFORCE-based approaches, scaling effectively to large datasets. This work thus enables efficient, scalable, and natural-language-aligned generative retrieval in large-scale recommender systems, with practical implications for integrating semantic IDs into LLM-based pipelines and conversational interfaces.
Abstract
Generative models are increasingly used in recommender systems, both for modeling user behavior as event sequences and for integrating large language models into recommendation pipelines. A key challenge in this setting is the extremely large cardinality of item spaces, which makes training generative models difficult and introduces a vocabulary gap between natural language and item identifiers. Semantic identifiers (semantic IDs), which represent items as sequences of low-cardinality tokens, have recently emerged as an effective solution to this problem. However, existing approaches generate semantic identifiers of fixed length, assigning the same description length to all items. This is inefficient, misaligned with natural language, and ignores the highly skewed frequency structure of real-world catalogs, where popular items and rare long-tail items exhibit fundamentally different information requirements. In parallel, the emergent communication literature studies how agents develop discrete communication protocols, often producing variable-length messages in which frequent concepts receive shorter descriptions. Despite the conceptual similarity, these ideas have not been systematically adopted in recommender systems. In this work, we bridge recommender systems and emergent communication by introducing variable-length semantic identifiers for recommendation. We propose a discrete variational autoencoder with Gumbel-Softmax reparameterization that learns item representations of adaptive length under a principled probabilistic framework, avoiding the instability of REINFORCE-based training and the fixed-length constraints of prior semantic ID methods.
