Table of Contents
Fetching ...

Less Finetuning, Better Retrieval: Rethinking LLM Adaptation for Biomedical Retrievers via Synthetic Data and Model Merging

Sameh Khattab, Jean-Philippe Corbeil, Osman Alperen Koraş, Amin Dada, Julian Friedrich, François Beaulieu, Paul Vozila, Jens Kleesiek

TL;DR

This work introduces Synthesize-Train-Merge (STM), a modular framework to convert decoder-only LLMs into domain-specific dense retrievers by combining synthetic hard negatives, retrieval prompt optimization, and model merging. Through systematic experiments on biomedical and general-domain tasks from MTEB, STM achieves up to 23.5% task-specific gains and produces merged retrievers that outperform individual experts and strong baselines without large-scale pretraining. Key insights include the dominance of prompt optimization over hard-negative mining in many settings, and the efficacy of Linear merging (MergeKit) to fuse expert representations into a single, robust retriever. The approach demonstrates a scalable, data-efficient path for adapting general LLMs to specialized biomedical retrieval while preserving broad-domain capabilities, with practical implications for RAG systems in specialized fields.

Abstract

Retrieval-augmented generation (RAG) has become the backbone of grounding Large Language Models (LLMs), improving knowledge updates and reducing hallucinations. Recently, LLM-based retriever models have shown state-of-the-art performance for RAG applications. However, several technical aspects remain underexplored on how to adapt general-purpose LLMs into effective domain-specific retrievers, especially in specialized domains such as biomedicine. We present Synthesize-Train-Merge (STM), a modular framework that enhances decoder-only LLMs with synthetic hard negatives, retrieval prompt optimization, and model merging. Experiments on a subset of 12 medical and general tasks from the MTEB benchmark show STM boosts task-specific experts by up to 23.5\% (average 7.5\%) and produces merged models that outperform both single experts and strong baselines without extensive pretraining. Our results demonstrate a scalable, efficient path for turning general LLMs into high-performing, domain-specialized retrievers, preserving general-domain capabilities while excelling on specialized tasks.

Less Finetuning, Better Retrieval: Rethinking LLM Adaptation for Biomedical Retrievers via Synthetic Data and Model Merging

TL;DR

This work introduces Synthesize-Train-Merge (STM), a modular framework to convert decoder-only LLMs into domain-specific dense retrievers by combining synthetic hard negatives, retrieval prompt optimization, and model merging. Through systematic experiments on biomedical and general-domain tasks from MTEB, STM achieves up to 23.5% task-specific gains and produces merged retrievers that outperform individual experts and strong baselines without large-scale pretraining. Key insights include the dominance of prompt optimization over hard-negative mining in many settings, and the efficacy of Linear merging (MergeKit) to fuse expert representations into a single, robust retriever. The approach demonstrates a scalable, data-efficient path for adapting general LLMs to specialized biomedical retrieval while preserving broad-domain capabilities, with practical implications for RAG systems in specialized fields.

Abstract

Retrieval-augmented generation (RAG) has become the backbone of grounding Large Language Models (LLMs), improving knowledge updates and reducing hallucinations. Recently, LLM-based retriever models have shown state-of-the-art performance for RAG applications. However, several technical aspects remain underexplored on how to adapt general-purpose LLMs into effective domain-specific retrievers, especially in specialized domains such as biomedicine. We present Synthesize-Train-Merge (STM), a modular framework that enhances decoder-only LLMs with synthetic hard negatives, retrieval prompt optimization, and model merging. Experiments on a subset of 12 medical and general tasks from the MTEB benchmark show STM boosts task-specific experts by up to 23.5\% (average 7.5\%) and produces merged models that outperform both single experts and strong baselines without extensive pretraining. Our results demonstrate a scalable, efficient path for turning general LLMs into high-performing, domain-specialized retrievers, preserving general-domain capabilities while excelling on specialized tasks.
Paper Structure (36 sections, 3 equations, 5 figures, 11 tables)

This paper contains 36 sections, 3 equations, 5 figures, 11 tables.

Figures (5)

  • Figure 1: Diagram of our recipe to obtain the STM retrievers: 1) synthetic data --- including 1.1) hard negative generation and 1.2) retrieval prompt optimization ---, 2) LoRA fine-tuning, and 3) model merging. We segment the BMRetriever dataset into four splits: Real Medical, Synthetic Medical, NLU, and Search.
  • Figure 2: Performance comparison of STM Merged Models versus models fine-tuned on the combined datasets of all merged experts, across three base models, using the average NDCG@10 metric across all datasets.
  • Figure 3: Performance averages of three base models pre-trained (PT) and/or fine-tuned (FT) on the BMRetriever datasets with 10M and 1.4M samples, respectively.
  • Figure 4: Performance averages across 3 runs of three base models fine-tuned on three different sample sizes of the BMRetriever dataset. Standard deviations are not displayed since they are below 0.01.
  • Figure 5: Merging weight coefficients for each expert for Linear and TIES techniques for each model.