Adapting General-Purpose Embedding Models to Private Datasets Using Keyword-based Retrieval
Yubai Wei, Jiale Han, Yi Yang
TL;DR
BMEmbed introduces a practical, unsupervised framework to adapt general-purpose text embedding models to private-domain data by using BM25-derived ranking signals as supervision. The pipeline includes domain event-driven query generation, BM25-based relevant sampling with ranking partitions, and listwise fine-tuning, enabling the embedding model to align with domain-specific terminology while preserving semantic generalization. Empirical results across multiple models and domains show consistent retrieval improvements over baselines such as BM25, CL, and RRF, with the method also balancing alignment and uniformity for robust performance. The approach is extensible to smaller models and alternative tasks, offering a scalable path to domain-specific improvements in retrieval-augmented generation.
Abstract
Text embedding models play a cornerstone role in AI applications, such as retrieval-augmented generation (RAG). While general-purpose text embedding models demonstrate strong performance on generic retrieval benchmarks, their effectiveness diminishes when applied to private datasets (e.g., company-specific proprietary data), which often contain specialized terminology and lingo. In this work, we introduce BMEmbed, a novel method for adapting general-purpose text embedding models to private datasets. By leveraging the well-established keyword-based retrieval technique (BM25), we construct supervisory signals from the ranking of keyword-based retrieval results to facilitate model adaptation. We evaluate BMEmbed across a range of domains, datasets, and models, showing consistent improvements in retrieval performance. Moreover, we provide empirical insights into how BM25-based signals contribute to improving embeddings by fostering alignment and uniformity, highlighting the value of this approach in adapting models to domain-specific data. We release the source code available at https://github.com/BaileyWei/BMEmbed for the research community.
