Table of Contents
Fetching ...

Language Models As Semantic Indexers

Bowen Jin, Hansi Zeng, Guoyin Wang, Xiusi Chen, Tianxin Wei, Ruirui Li, Zhengyang Wang, Zheng Li, Yang Li, Hanqing Lu, Suhang Wang, Jiawei Han, Xianfeng Tang

TL;DR

LMIndexer introduces a self-supervised framework that learns semantic IDs directly from text using a generative language model. By modeling IDs as sequential discrete representations and employing a reconstructor, it jointly captures document semantics and hierarchical structure while mitigating supervision deficiencies through progressive training and contrastive learning. Across three downstream tasks—sequential recommendation, product search, and document retrieval—LMIndexer consistently surpasses competitive baselines on five diverse datasets, demonstrating the practicality of end-to-end semantic indexing. The approach enables robust, zero-shot or fine-tuned deployment and highlights the potential of neural semantic IDs for scalable information access.

Abstract

Semantic identifier (ID) is an important concept in information retrieval that aims to preserve the semantics of objects such as documents and items inside their IDs. Previous studies typically adopt a two-stage pipeline to learn semantic IDs by first procuring embeddings using off-the-shelf text encoders and then deriving IDs based on the embeddings. However, each step introduces potential information loss, and there is usually an inherent mismatch between the distribution of embeddings within the latent space produced by text encoders and the anticipated distribution required for semantic indexing. It is non-trivial to design a method that can learn the document's semantic representations and its hierarchical structure simultaneously, given that semantic IDs are discrete and sequentially structured, and the semantic supervision is deficient. In this paper, we introduce LMIndexer, a self-supervised framework to learn semantic IDs with a generative language model. We tackle the challenge of sequential discrete ID by introducing a semantic indexer capable of generating neural sequential discrete representations with progressive training and contrastive learning. In response to the semantic supervision deficiency, we propose to train the model with a self-supervised document reconstruction objective. We show the high quality of the learned IDs and demonstrate their effectiveness on three tasks including recommendation, product search, and document retrieval on five datasets from various domains. Code is available at https://github.com/PeterGriffinJin/LMIndexer.

Language Models As Semantic Indexers

TL;DR

LMIndexer introduces a self-supervised framework that learns semantic IDs directly from text using a generative language model. By modeling IDs as sequential discrete representations and employing a reconstructor, it jointly captures document semantics and hierarchical structure while mitigating supervision deficiencies through progressive training and contrastive learning. Across three downstream tasks—sequential recommendation, product search, and document retrieval—LMIndexer consistently surpasses competitive baselines on five diverse datasets, demonstrating the practicality of end-to-end semantic indexing. The approach enables robust, zero-shot or fine-tuned deployment and highlights the potential of neural semantic IDs for scalable information access.

Abstract

Semantic identifier (ID) is an important concept in information retrieval that aims to preserve the semantics of objects such as documents and items inside their IDs. Previous studies typically adopt a two-stage pipeline to learn semantic IDs by first procuring embeddings using off-the-shelf text encoders and then deriving IDs based on the embeddings. However, each step introduces potential information loss, and there is usually an inherent mismatch between the distribution of embeddings within the latent space produced by text encoders and the anticipated distribution required for semantic indexing. It is non-trivial to design a method that can learn the document's semantic representations and its hierarchical structure simultaneously, given that semantic IDs are discrete and sequentially structured, and the semantic supervision is deficient. In this paper, we introduce LMIndexer, a self-supervised framework to learn semantic IDs with a generative language model. We tackle the challenge of sequential discrete ID by introducing a semantic indexer capable of generating neural sequential discrete representations with progressive training and contrastive learning. In response to the semantic supervision deficiency, we propose to train the model with a self-supervised document reconstruction objective. We show the high quality of the learned IDs and demonstrate their effectiveness on three tasks including recommendation, product search, and document retrieval on five datasets from various domains. Code is available at https://github.com/PeterGriffinJin/LMIndexer.
Paper Structure (29 sections, 13 equations, 8 figures, 15 tables, 1 algorithm)

This paper contains 29 sections, 13 equations, 8 figures, 15 tables, 1 algorithm.

Figures (8)

  • Figure 1: The LMIndexer self-supervised ID learning framework overview. The proposed semantic indexer includes a semantic ID encoder and several codebooks. During self-supervised learning, there is a reconstructor to reconstruct the input document from semantic ID representations.
  • Figure 2: LMIndexer can be fine-tuned on downstream tasks including recommendation (user history as input and item ID as output) and retrieval (query as input and document ID as output).
  • Figure 3: Semantic ID qualitative study on Amazon-Beauty.
  • Figure 4: Semantic indexer training analysis on Amazon-sports. x-axis denotes the training step and y-axis denotes the evaluation metrics. Reconstructor collapse analysis (a): The reconstructor suffers from low reconstruction Macro-F1 without reconstructor warm-up (blue). Posterior collapse analysis (b,c): The semantic indexer suffers from generating homogeneous IDs (low perplexity), and results in low reconstruction Macro-F1, without encoder and codebook warm-up (blue). Contrastive learning analysis (d,e): Documents sharing prefix ID tend to have similar next position ID (low diff ratio) and low diversity (low perplexity) without contrastive objective (blue).
  • Figure 5: Semantic ID length & codebook size study on Amazon.
  • ...and 3 more figures