Simple Domain Adaptation for Sparse Retrievers

Mathias Vast; Yuxuan Zong; Basile Van Cooten; Benjamin Piwowarski; Laure Soulier

Simple Domain Adaptation for Sparse Retrievers

Mathias Vast, Yuxuan Zong, Basile Van Cooten, Benjamin Piwowarski, Laure Soulier

TL;DR

This work tackles domain adaptation in information retrieval where labeled in-domain data are scarce, focusing on sparse first-stage retrievers. It extends language-domain adaptation ideas by introducing a cross-domain pre-training framework that separates domain-specific parameters $P_{domain}$ from task-specific parameters $P_{task}$, performing MLM pre-training across source and target domains and fine-tuning only the task portion on the source. At inference on the target domain, the model combines $P_{domain}^{target}$ with $P_{task}^{source}$, enabling effective cross-domain transfer without annotating new data. Empirical results on SPLADE show average gains in $nDCG@10$ of $0.7$–$1.4$ points over zero-shot, with best performance when training the first $k ext{ domain layers}$ around 2, and indicate robustness improvements from additional pre-training on the source domain, while second-stage ranking (MonoBERT) does not benefit from this approach. The method is cost-efficient and reusable across domains, suggesting practical applicability for deploying IR systems to new domains or languages where labeled data is limited.

Abstract

In Information Retrieval, and more generally in Natural Language Processing, adapting models to specific domains is conducted through fine-tuning. Despite the successes achieved by this method and its versatility, the need for human-curated and labeled data makes it impractical to transfer to new tasks, domains, and/or languages when training data doesn't exist. Using the model without training (zero-shot) is another option that however suffers an effectiveness cost, especially in the case of first-stage retrievers. Numerous research directions have emerged to tackle these issues, most of them in the context of adapting to a task or a language. However, the literature is scarcer for domain (or topic) adaptation. In this paper, we address this issue of cross-topic discrepancy for a sparse first-stage retriever by transposing a method initially designed for language adaptation. By leveraging pre-training on the target data to learn domain-specific knowledge, this technique alleviates the need for annotated data and expands the scope of domain adaptation. Despite their relatively good generalization ability, we show that even sparse retrievers can benefit from our simple domain adaptation method.

Simple Domain Adaptation for Sparse Retrievers

TL;DR

from task-specific parameters

, performing MLM pre-training across source and target domains and fine-tuning only the task portion on the source. At inference on the target domain, the model combines

with

, enabling effective cross-domain transfer without annotating new data. Empirical results on SPLADE show average gains in

–

points over zero-shot, with best performance when training the first

around 2, and indicate robustness improvements from additional pre-training on the source domain, while second-stage ranking (MonoBERT) does not benefit from this approach. The method is cost-efficient and reusable across domains, suggesting practical applicability for deploying IR systems to new domains or languages where labeled data is limited.

Abstract

Paper Structure (9 sections, 1 figure, 3 tables)

This paper contains 9 sections, 1 figure, 3 tables.

Introduction
Related works
Cross-Domain Adaptation of a Neural Network model
Experiments
Datasets, baselines, and ablations
Results
Discussion and limitations.
Conclusion
Ackowledgements

Figures (1)

Figure 1: Illustration of the cross-domain adaptation process

Simple Domain Adaptation for Sparse Retrievers

TL;DR

Abstract

Simple Domain Adaptation for Sparse Retrievers

Authors

TL;DR

Abstract

Table of Contents

Figures (1)