Table of Contents
Fetching ...

Improved Learned Sparse Retrieval with Corpus-Specific Vocabularies

Puxuan Yu, Antonio Mallia, Matthias Petri

TL;DR

The paper addresses the efficiency and effectiveness gap in learned sparse retrieval by introducing corpus-specific vocabularies (CSV) learned on the target corpus. It investigates how vocabulary size, target-corpus pre-training, corpus-specific document expansion (via TILDE-AUG-CSV), and distillation affect performance across SPLADE and uniCOIL on MS MARCO and TREC benchmarks. Results show up to 12% relative improvement in retrieval quality and up to 50% reductions in query latency, with 100k vocabularies often sufficing for MS MARCO v1 and larger vocabularies offering speedups when expansion is controlled. The approach is simple, generalizable across models, and opens avenues for refined vocabulary selection and index-aware expansion to achieve new efficiency-effectiveness trade-offs in learned sparse retrieval systems.

Abstract

We explore leveraging corpus-specific vocabularies that improve both efficiency and effectiveness of learned sparse retrieval systems. We find that pre-training the underlying BERT model on the target corpus, specifically targeting different vocabulary sizes incorporated into the document expansion process, improves retrieval quality by up to 12% while in some scenarios decreasing latency by up to 50%. Our experiments show that adopting corpus-specific vocabulary and increasing vocabulary size decreases average postings list length which in turn reduces latency. Ablation studies show interesting interactions between custom vocabularies, document expansion techniques, and sparsification objectives of sparse models. Both effectiveness and efficiency improvements transfer to different retrieval approaches such as uniCOIL and SPLADE and offer a simple yet effective approach to providing new efficiency-effectiveness trade-offs for learned sparse retrieval systems.

Improved Learned Sparse Retrieval with Corpus-Specific Vocabularies

TL;DR

The paper addresses the efficiency and effectiveness gap in learned sparse retrieval by introducing corpus-specific vocabularies (CSV) learned on the target corpus. It investigates how vocabulary size, target-corpus pre-training, corpus-specific document expansion (via TILDE-AUG-CSV), and distillation affect performance across SPLADE and uniCOIL on MS MARCO and TREC benchmarks. Results show up to 12% relative improvement in retrieval quality and up to 50% reductions in query latency, with 100k vocabularies often sufficing for MS MARCO v1 and larger vocabularies offering speedups when expansion is controlled. The approach is simple, generalizable across models, and opens avenues for refined vocabulary selection and index-aware expansion to achieve new efficiency-effectiveness trade-offs in learned sparse retrieval systems.

Abstract

We explore leveraging corpus-specific vocabularies that improve both efficiency and effectiveness of learned sparse retrieval systems. We find that pre-training the underlying BERT model on the target corpus, specifically targeting different vocabulary sizes incorporated into the document expansion process, improves retrieval quality by up to 12% while in some scenarios decreasing latency by up to 50%. Our experiments show that adopting corpus-specific vocabulary and increasing vocabulary size decreases average postings list length which in turn reduces latency. Ablation studies show interesting interactions between custom vocabularies, document expansion techniques, and sparsification objectives of sparse models. Both effectiveness and efficiency improvements transfer to different retrieval approaches such as uniCOIL and SPLADE and offer a simple yet effective approach to providing new efficiency-effectiveness trade-offs for learned sparse retrieval systems.
Paper Structure (24 sections, 3 figures, 6 tables)

This paper contains 24 sections, 3 figures, 6 tables.

Figures (3)

  • Figure 1: Latency and effectiveness improvements achieved by leveraging Corpus Specific Vocabularies (CSV) (with different vocabulary sizes) compared to baseline learned sparse retrieval models.
  • Figure 2: A high-level overview of the workflow described in this work. As the vocabulary of the language model is learned on the target retrieval corpus, and that the sparse retrieval models (e.g., SPLADE and uniCOIL) and the document expansion models (e.g., TILDE) all use the language model as backbones, all the components in the learned sparse retrieval systems, including the acquired inverted index, are influenced by the corpus-specific vocabulary (CSV).
  • Figure 3: Cumulative distribution of lists that have list max scores higher than a given value. BERT displaying less skew in list max scores which negatively affects performance.