Table of Contents
Fetching ...

On the Necessity of World Knowledge for Mitigating Missing Labels in Extreme Classification

Jatin Prakash, Anirudh Buvanesh, Bishal Santra, Deepak Saini, Sachin Yadav, Jian Jiao, Yashoteja Prabhu, Amit Sharma, Manik Varma

TL;DR

This paper formalizes how systematic missing labels in extreme classification create irrecoverable knowledge gaps that biased training data cannot mend. It introduces SKIM, a scalable approach that distills task-specific knowledge from unstructured metadata into a small language model, generating diverse synthetic queries per document and mapping them to real training queries to augment the dataset. Through theoretical results and extensive experiments on public XC datasets and a large sponsored-search dataset, SKIM consistently improves Recall@K and business metrics over strong baselines, while remaining computationally efficient via SLM distillation and ANNS-based query mapping. The work demonstrates the practical value of infusing external world knowledge at scale for retrieval tasks and provides open-source resources to facilitate adoption and further research.

Abstract

Extreme Classification (XC) aims to map a query to the most relevant documents from a very large document set. XC algorithms used in real-world applications learn this mapping from datasets curated from implicit feedback, such as user clicks. However, these datasets inevitably suffer from missing labels. In this work, we observe that systematic missing labels lead to missing knowledge, which is critical for accurately modelling relevance between queries and documents. We formally show that this absence of knowledge cannot be recovered using existing methods such as propensity weighting and data imputation strategies that solely rely on the training dataset. While LLMs provide an attractive solution to augment the missing knowledge, leveraging them in applications with low latency requirements and large document sets is challenging. To incorporate missing knowledge at scale, we propose SKIM (Scalable Knowledge Infusion for Missing Labels), an algorithm that leverages a combination of small LM and abundant unstructured meta-data to effectively mitigate the missing label problem. We show the efficacy of our method on large-scale public datasets through exhaustive unbiased evaluation ranging from human annotations to simulations inspired from industrial settings. SKIM outperforms existing methods on Recall@100 by more than 10 absolute points. Additionally, SKIM scales to proprietary query-ad retrieval datasets containing 10 million documents, outperforming contemporary methods by 12% in offline evaluation and increased ad click-yield by 1.23% in an online A/B test conducted on a popular search engine. We release our code, prompts, trained XC models and finetuned SLMs at: https://github.com/bicycleman15/skim

On the Necessity of World Knowledge for Mitigating Missing Labels in Extreme Classification

TL;DR

This paper formalizes how systematic missing labels in extreme classification create irrecoverable knowledge gaps that biased training data cannot mend. It introduces SKIM, a scalable approach that distills task-specific knowledge from unstructured metadata into a small language model, generating diverse synthetic queries per document and mapping them to real training queries to augment the dataset. Through theoretical results and extensive experiments on public XC datasets and a large sponsored-search dataset, SKIM consistently improves Recall@K and business metrics over strong baselines, while remaining computationally efficient via SLM distillation and ANNS-based query mapping. The work demonstrates the practical value of infusing external world knowledge at scale for retrieval tasks and provides open-source resources to facilitate adoption and further research.

Abstract

Extreme Classification (XC) aims to map a query to the most relevant documents from a very large document set. XC algorithms used in real-world applications learn this mapping from datasets curated from implicit feedback, such as user clicks. However, these datasets inevitably suffer from missing labels. In this work, we observe that systematic missing labels lead to missing knowledge, which is critical for accurately modelling relevance between queries and documents. We formally show that this absence of knowledge cannot be recovered using existing methods such as propensity weighting and data imputation strategies that solely rely on the training dataset. While LLMs provide an attractive solution to augment the missing knowledge, leveraging them in applications with low latency requirements and large document sets is challenging. To incorporate missing knowledge at scale, we propose SKIM (Scalable Knowledge Infusion for Missing Labels), an algorithm that leverages a combination of small LM and abundant unstructured meta-data to effectively mitigate the missing label problem. We show the efficacy of our method on large-scale public datasets through exhaustive unbiased evaluation ranging from human annotations to simulations inspired from industrial settings. SKIM outperforms existing methods on Recall@100 by more than 10 absolute points. Additionally, SKIM scales to proprietary query-ad retrieval datasets containing 10 million documents, outperforming contemporary methods by 12% in offline evaluation and increased ad click-yield by 1.23% in an online A/B test conducted on a popular search engine. We release our code, prompts, trained XC models and finetuned SLMs at: https://github.com/bicycleman15/skim
Paper Structure (35 sections, 6 theorems, 7 figures, 10 tables, 1 algorithm)

This paper contains 35 sections, 6 theorems, 7 figures, 10 tables, 1 algorithm.

Key Result

lemma 1

If $y_{il}=0 \;\;\;\; \forall (\mathbf{x}_i,\mathbf{z}_l) \in D_m$, then for any test pair $(\mathbf{x},\mathbf{z}) \sim D_m$, $R(\mathbf{x},\mathbf{z},k_m) \perp \!\!\! \perp \mathcal{D}$ where $\mathcal{D} = \{\{\mathbf{x}_i,\mathbf{y}_i\}_{i=1}^{N},\{\mathbf{z}_l\}_{l=1}^L\}$ is the training data

Figures (7)

  • Figure 1: Connections between queries and documents define the knowledge available in a retrieval dataset. In the above example, the document concept exome sequencing is connected to the query "exon" through a user click, encoding the knowledge that concepts exon and exome sequencing are related. The document concept exon is not directly connected to query "exome" but this relationship can possibly be learnt (see Table \ref{['tab:label_concepts']} in appendix). However, the relevance of exon to genes or RNA is impossible to learn through this dataset because there are no connecting clicks providing this knowledge, that is, those connections (documents) are systematically missing for exon. (Note: implies relevant & clicked, implies relevant & missing)
  • Figure 2: Steps of the SKIM algorithm depicting how we bridge the missing knowledge in biased training datasets. In Step 1, for a URL (document) "https://www.encyclopedia.com/science-and-technology/biology-and-genetics/genetics-and-genetic-engineering/rna-processing", the finetuned SLM generates diverse synthetic queries spanning concepts like protein synthesis, exons, poly-a tail etc using the available unstructured meta-data (see Figure \ref{['fig:slm_rephrasing_metadata_figure']} in appendix). In Step 2, a retriever is used to increase the coverage of the chosen document to relevant train queries through these synthetic queries, e.g. synthetic query "what are exons" is mapped to similar train queries like "define exon" and "exon" which were missing for the document. The retriever additionally filters out irrelevant train queries for the document using the similarity threshold $\tau$.
  • Figure 3: X axis represents increasing fraction of relevant pairs or clicks, and in brackets, we show the top-K that was used in simulation to collect that fraction of clicks. We compare Renée models trained with (a) only click data (MNAR), (b) click data + using IPS with golden propensities, (c) click data + SKIM and finally, (d) click data created using sampling relevant pairs uniformly at random (MAR). All four settings are compared across different fraction of relevant pairs being exposed.
  • Figure 4: We use named entity as a proxy for knowledge to show that XC applications require vast and long-tail knowledge. On the X-axis we plot the index of the named entities and on the Y-axis we show the normalized cumulative frequency of the named entities. In LF-Orcas-800K dataset Dahiya23bDEXA, where the task is matching user queries to web-page URL, around 80% of the knowledge is covered by 2.23% entities. While in the LF-WikiTitlesHierarchy-2M dataset where the task is to match Wikipedia titles to categories, around 11.28% of the entities cover the same fraction of knowledge. This highlights the extensive and long-tail knowledge required by XC applications.
  • Figure 5: Intuitive visual explanation of how a finetuned SLM uses implicit knowledge contained in unstructured metadata to generate diverse synthetic queries that are representative of different knowledge concepts that could be absent in the training dataset. In other words, the finetuned SLM is able to skim through the unstructured metadata, ignoring the non-relevant text, and generate only the relevant synthetic queries that are representative of different knowledge concepts about the document.
  • ...and 2 more figures

Theorems & Definitions (6)

  • lemma 1
  • corollary 1
  • Theorem 1
  • lemma 2
  • corollary 2
  • Theorem -1