Improved Learned Sparse Retrieval with Corpus-Specific Vocabularies
Puxuan Yu, Antonio Mallia, Matthias Petri
TL;DR
The paper addresses the efficiency and effectiveness gap in learned sparse retrieval by introducing corpus-specific vocabularies (CSV) learned on the target corpus. It investigates how vocabulary size, target-corpus pre-training, corpus-specific document expansion (via TILDE-AUG-CSV), and distillation affect performance across SPLADE and uniCOIL on MS MARCO and TREC benchmarks. Results show up to 12% relative improvement in retrieval quality and up to 50% reductions in query latency, with 100k vocabularies often sufficing for MS MARCO v1 and larger vocabularies offering speedups when expansion is controlled. The approach is simple, generalizable across models, and opens avenues for refined vocabulary selection and index-aware expansion to achieve new efficiency-effectiveness trade-offs in learned sparse retrieval systems.
Abstract
We explore leveraging corpus-specific vocabularies that improve both efficiency and effectiveness of learned sparse retrieval systems. We find that pre-training the underlying BERT model on the target corpus, specifically targeting different vocabulary sizes incorporated into the document expansion process, improves retrieval quality by up to 12% while in some scenarios decreasing latency by up to 50%. Our experiments show that adopting corpus-specific vocabulary and increasing vocabulary size decreases average postings list length which in turn reduces latency. Ablation studies show interesting interactions between custom vocabularies, document expansion techniques, and sparsification objectives of sparse models. Both effectiveness and efficiency improvements transfer to different retrieval approaches such as uniCOIL and SPLADE and offer a simple yet effective approach to providing new efficiency-effectiveness trade-offs for learned sparse retrieval systems.
