Table of Contents
Fetching ...

DREditor: An Time-efficient Approach for Building a Domain-specific Dense Retrieval Model

Chen Huang, Duanyu Feng, Wenqiang Lei, Jiancheng Lv

TL;DR

DREditor addresses the time- and resource-intensive process of domain-specific dense retrieval by introducing a post-hoc embedding calibration method. It learns a linear edit operator $W_{QA}=I+\Delta W$ through a least-squares objective to align domain-specific queries with their corresponding corpus embeddings, enabling fast, non-iterative editing. The approach yields 100–300× speedups over gradient-based fine-tuning while achieving comparable retrieval performance across finance, science, and biomedical domains, including zero-shot scenarios (ZeroDR); it also provides a closed-form solution and a sampling-based efficiency improvement. Practically, this work demonstrates the viability of embedding retrofitting as a lighter-weight alternative for building domain-specific DR models, with implications for scalable enterprise search deployments.

Abstract

Deploying dense retrieval models efficiently is becoming increasingly important across various industries. This is especially true for enterprise search services, where customizing search engines to meet the time demands of different enterprises in different domains is crucial. Motivated by this, we develop a time-efficient approach called DREditor to edit the matching rule of an off-the-shelf dense retrieval model to suit a specific domain. This is achieved by directly calibrating the output embeddings of the model using an efficient and effective linear mapping. This mapping is powered by an edit operator that is obtained by solving a specially constructed least squares problem. Compared to implicit rule modification via long-time finetuning, our experimental results show that DREditor provides significant advantages on different domain-specific datasets, dataset sources, retrieval models, and computing devices. It consistently enhances time efficiency by 100-300 times while maintaining comparable or even superior retrieval performance. In a broader context, we take the first step to introduce a novel embedding calibration approach for the retrieval task, filling the technical blank in the current field of embedding calibration. This approach also paves the way for building domain-specific dense retrieval models efficiently and inexpensively.

DREditor: An Time-efficient Approach for Building a Domain-specific Dense Retrieval Model

TL;DR

DREditor addresses the time- and resource-intensive process of domain-specific dense retrieval by introducing a post-hoc embedding calibration method. It learns a linear edit operator through a least-squares objective to align domain-specific queries with their corresponding corpus embeddings, enabling fast, non-iterative editing. The approach yields 100–300× speedups over gradient-based fine-tuning while achieving comparable retrieval performance across finance, science, and biomedical domains, including zero-shot scenarios (ZeroDR); it also provides a closed-form solution and a sampling-based efficiency improvement. Practically, this work demonstrates the viability of embedding retrofitting as a lighter-weight alternative for building domain-specific DR models, with implications for scalable enterprise search deployments.

Abstract

Deploying dense retrieval models efficiently is becoming increasingly important across various industries. This is especially true for enterprise search services, where customizing search engines to meet the time demands of different enterprises in different domains is crucial. Motivated by this, we develop a time-efficient approach called DREditor to edit the matching rule of an off-the-shelf dense retrieval model to suit a specific domain. This is achieved by directly calibrating the output embeddings of the model using an efficient and effective linear mapping. This mapping is powered by an edit operator that is obtained by solving a specially constructed least squares problem. Compared to implicit rule modification via long-time finetuning, our experimental results show that DREditor provides significant advantages on different domain-specific datasets, dataset sources, retrieval models, and computing devices. It consistently enhances time efficiency by 100-300 times while maintaining comparable or even superior retrieval performance. In a broader context, we take the first step to introduce a novel embedding calibration approach for the retrieval task, filling the technical blank in the current field of embedding calibration. This approach also paves the way for building domain-specific dense retrieval models efficiently and inexpensively.
Paper Structure (24 sections, 3 theorems, 5 equations, 6 figures, 8 tables, 2 algorithms)

This paper contains 24 sections, 3 theorems, 5 equations, 6 figures, 8 tables, 2 algorithms.

Key Result

theorem 1

A weighted least squares form of the optimization problem (eq3) could equal the following form that involves only $QA$.

Figures (6)

  • Figure 1: Fine-tuning an individual dense retrieval model for each enterprise is time-consuming, while the proposed DREditor is time-efficient due to without iterative gradient-based optimization.
  • Figure 2: Pipline of DREditor. 1) For data that violate the matching rule of DR, DREditor utilizes them to construct an edit operator without model training. 2) It utilizes the operator to calibrate the query and corpus embeddings.
  • Figure 3: Illustration of embedding space on SciFact with backbone SBERT. ${x_q}'$ and ${x_a}'$ as question and answer embedding after the calibration. DREditor enhances the semantic association between the questions and answers while preserving the embeddings of answers unchanged.
  • Figure 4: The embedding space on FiQA datasets with backbone SBERT
  • Figure 5: The embedding space on NFCorpus datasets with backbone SBERT
  • ...and 1 more figures

Theorems & Definitions (5)

  • theorem 1
  • theorem 2
  • Remark 1
  • Remark 2
  • Theorem \ref{theory}