Study on LLMs for Promptagator-Style Dense Retriever Training
Daniel Gwon, Nour Jedidi, Jimmy Lin
TL;DR
This work investigates whether open-source LLMs at accessible scales (≤14B) can substitute proprietary models in Promptagator-style synthetic data generation for dense retrievers. By introducing Promptodile and evaluating 10 QGen models across BEIR datasets, the study shows that small LLMs (around 3B) achieve competitive performance with the large Promptagator baseline, and that larger models yield diminishing returns. The results highlight cost, privacy, and compute advantages, demonstrate complementary gains with MS-MARCO transfer and alternative retrievers, and provide actionable guidance for deploying domain-specific dense retrieval under constraints. Overall, the findings offer a practical pathway for robust, privacy-conscious dense retrieval system development using open-source LLMs.
Abstract
Promptagator demonstrated that Large Language Models (LLMs) with few-shot prompts can be used as task-specific query generators for fine-tuning domain-specialized dense retrieval models. However, the original Promptagator approach relied on proprietary and large-scale LLMs which users may not have access to or may be prohibited from using with sensitive data. In this work, we study the impact of open-source LLMs at accessible scales ($\leq$14B parameters) as an alternative. Our results demonstrate that open-source LLMs as small as 3B parameters can serve as effective Promptagator-style query generators. We hope our work will inform practitioners with reliable alternatives for synthetic data generation and give insights to maximize fine-tuning results for domain-specific applications.
