Table of Contents
Fetching ...

CAT-ID$^2$: Category-Tree Integrated Document Identifier Learning for Generative Retrieval In E-commerce

Xiaoyu Liu, Fuwei Zhang, Yiqing Wu, Xinyu Jia, Zenghua Xia, Fuzhen Zhuang, Zhao Zhang, Fei Jiang, Wei Lin

TL;DR

CAT-ID^2 addresses the challenge of learning powerful document IDs for Generative Retrieval in e-commerce by embedding hierarchical category information into DocID learning. It introduces three losses—Hierarchical Class Constraint Loss $\mathcal{L}_{\text{HCC}}$, Cluster Scale Constraint Loss $\mathcal{L}_{\text{CSC}}$, and Dispersion Loss $\mathcal{L}_{\text{Dis}}$—alongside the reconstruction loss $\mathcal{L}_{\text{rq}}$ within a Residual Quantization (RQ-VAE) framework, plus Sinkhorn post-processing and downstream LLM sequence training. Extensive offline and online experiments on ESCI demonstrate state-of-the-art GR recall and favorable online gains for ambiguous and long-tail queries, confirming that category hierarchies improve ID quality and downstream retrieval. The work highlights the practical impact of end-to-end GR for e-commerce search, offering a scalable approach that balances global semantic coherence with category-aware discrimination. The combination of quantization-based DocID learning, category-informed constraints, and end-to-end LLM finetuning provides a robust pathway for deploying GR in real-world systems.

Abstract

Generative retrieval (GR) has gained significant attention as an effective paradigm that integrates the capabilities of large language models (LLMs). It generally consists of two stages: constructing discrete semantic identifiers (IDs) for documents and retrieving documents by autoregressively generating ID tokens. The core challenge in GR is how to construct document IDs (DocIDS) with strong representational power. Good IDs should exhibit two key properties: similar documents should have more similar IDs, and each document should maintain a distinct and unique ID. However, most existing methods ignore native category information, which is common and critical in E-commerce. Therefore, we propose a novel ID learning method, CAtegory-Tree Integrated Document IDentifier (CAT-ID$^2$), incorporating prior category information into the semantic IDs. CAT-ID$^2$ includes three key modules: a Hierarchical Class Constraint Loss to integrate category information layer by layer during quantization, a Cluster Scale Constraint Loss for uniform ID token distribution, and a Dispersion Loss to improve the distinction of reconstructed documents. These components enable CAT-ID$^2$ to generate IDs that make similar documents more alike while preserving the uniqueness of different documents' representations. Extensive offline and online experiments confirm the effectiveness of our method, with online A/B tests showing a 0.33% increase in average orders per thousand users for ambiguous intent queries and 0.24% for long-tail queries.

CAT-ID$^2$: Category-Tree Integrated Document Identifier Learning for Generative Retrieval In E-commerce

TL;DR

CAT-ID^2 addresses the challenge of learning powerful document IDs for Generative Retrieval in e-commerce by embedding hierarchical category information into DocID learning. It introduces three losses—Hierarchical Class Constraint Loss , Cluster Scale Constraint Loss , and Dispersion Loss —alongside the reconstruction loss within a Residual Quantization (RQ-VAE) framework, plus Sinkhorn post-processing and downstream LLM sequence training. Extensive offline and online experiments on ESCI demonstrate state-of-the-art GR recall and favorable online gains for ambiguous and long-tail queries, confirming that category hierarchies improve ID quality and downstream retrieval. The work highlights the practical impact of end-to-end GR for e-commerce search, offering a scalable approach that balances global semantic coherence with category-aware discrimination. The combination of quantization-based DocID learning, category-informed constraints, and end-to-end LLM finetuning provides a robust pathway for deploying GR in real-world systems.

Abstract

Generative retrieval (GR) has gained significant attention as an effective paradigm that integrates the capabilities of large language models (LLMs). It generally consists of two stages: constructing discrete semantic identifiers (IDs) for documents and retrieving documents by autoregressively generating ID tokens. The core challenge in GR is how to construct document IDs (DocIDS) with strong representational power. Good IDs should exhibit two key properties: similar documents should have more similar IDs, and each document should maintain a distinct and unique ID. However, most existing methods ignore native category information, which is common and critical in E-commerce. Therefore, we propose a novel ID learning method, CAtegory-Tree Integrated Document IDentifier (CAT-ID), incorporating prior category information into the semantic IDs. CAT-ID includes three key modules: a Hierarchical Class Constraint Loss to integrate category information layer by layer during quantization, a Cluster Scale Constraint Loss for uniform ID token distribution, and a Dispersion Loss to improve the distinction of reconstructed documents. These components enable CAT-ID to generate IDs that make similar documents more alike while preserving the uniqueness of different documents' representations. Extensive offline and online experiments confirm the effectiveness of our method, with online A/B tests showing a 0.33% increase in average orders per thousand users for ambiguous intent queries and 0.24% for long-tail queries.

Paper Structure

This paper contains 17 sections, 9 equations, 6 figures, 9 tables.

Figures (6)

  • Figure 1: An example of Category-Tree in E-commerce.
  • Figure 2: Overall framework of CAT-ID$^2$. It comprises two stages: DocID Learning and Generative Model Training. In the DocID Learning, Hierarchical Class Constraint Loss $\mathcal{L}_\text{HCC}$, Cluster Scale Constraint Loss $\mathcal{L}_\text{CSC}$, and Dispersion Loss $\mathcal{L}_\text{Dis}$ are introduced alongside the original residual quantinize loss $\mathcal{L}_\text{rq}$. HCCL integrates prior category information, CSCL ensures uniform codebook utilization to prevent collapse, and Dispersion Loss enforces distinct semantic IDs for different documents.
  • Figure 3: Layer distribution of IDs across the top 10 categories. Larger nodes indicate a greater number of samples within the corresponding codebook.
  • Figure 4: Visualization of different documents under four categories: 1) Toy & Games. 2) Clothing, Shoes & Jewelry. 3) Arts, Crafts & Sewing. 4) Cell Phones & Accessories.
  • Figure 5: The impact of different loss weights ($\alpha$, $\beta$, $\gamma$), codebook size, and category depth on model performance.
  • ...and 1 more figures