Simple Domain Adaptation for Sparse Retrievers
Mathias Vast, Yuxuan Zong, Basile Van Cooten, Benjamin Piwowarski, Laure Soulier
TL;DR
This work tackles domain adaptation in information retrieval where labeled in-domain data are scarce, focusing on sparse first-stage retrievers. It extends language-domain adaptation ideas by introducing a cross-domain pre-training framework that separates domain-specific parameters $P_{domain}$ from task-specific parameters $P_{task}$, performing MLM pre-training across source and target domains and fine-tuning only the task portion on the source. At inference on the target domain, the model combines $P_{domain}^{target}$ with $P_{task}^{source}$, enabling effective cross-domain transfer without annotating new data. Empirical results on SPLADE show average gains in $nDCG@10$ of $0.7$–$1.4$ points over zero-shot, with best performance when training the first $k ext{ domain layers}$ around 2, and indicate robustness improvements from additional pre-training on the source domain, while second-stage ranking (MonoBERT) does not benefit from this approach. The method is cost-efficient and reusable across domains, suggesting practical applicability for deploying IR systems to new domains or languages where labeled data is limited.
Abstract
In Information Retrieval, and more generally in Natural Language Processing, adapting models to specific domains is conducted through fine-tuning. Despite the successes achieved by this method and its versatility, the need for human-curated and labeled data makes it impractical to transfer to new tasks, domains, and/or languages when training data doesn't exist. Using the model without training (zero-shot) is another option that however suffers an effectiveness cost, especially in the case of first-stage retrievers. Numerous research directions have emerged to tackle these issues, most of them in the context of adapting to a task or a language. However, the literature is scarcer for domain (or topic) adaptation. In this paper, we address this issue of cross-topic discrepancy for a sparse first-stage retriever by transposing a method initially designed for language adaptation. By leveraging pre-training on the target data to learn domain-specific knowledge, this technique alleviates the need for annotated data and expands the scope of domain adaptation. Despite their relatively good generalization ability, we show that even sparse retrievers can benefit from our simple domain adaptation method.
