SyNeg: LLM-Driven Synthetic Hard-Negatives for Dense Retrieval
Xiaopeng Li, Xiangyang Li, Hao Zhang, Zhaocheng Du, Pengyue Jia, Yichao Wang, Xiangyu Zhao, Huifeng Guo, Ruiming Tang
TL;DR
Dense Retrieval performance heavily depends on the quality and diversity of negative samples. This paper introduces SyNeg, an LLM-driven framework that generates high-quality hard negatives through a multi-attribute self-reflection prompt and combines them with retrieved negatives via a hybrid instance-level mixing strategy, followed by contrastive fine-tuning. The approach yields consistent improvements across BEIR benchmarks and multiple DR backbones, with notable gains on knowledge-intensive tasks, and is supported by theoretical analysis linking negative sampling quality to training dynamics and MRR. The work highlights the practical impact of synthetic negatives for scalable, robust dense retrieval and offers detailed guidance on prompting, mixing, and hyperparameter tuning. Limitations include dependence on API-based LLMs and allocation costs, suggesting future exploration of open-source LLMs and broader modalities.
Abstract
The performance of Dense retrieval (DR) is significantly influenced by the quality of negative sampling. Traditional DR methods primarily depend on naive negative sampling techniques or on mining hard negatives through external retriever and meticulously crafted strategies. However, naive negative sampling often fails to adequately capture the accurate boundaries between positive and negative samples, whereas existing hard negative sampling methods are prone to false negatives, resulting in performance degradation and training instability. Recent advancements in large language models (LLMs) offer an innovative solution to these challenges by generating contextually rich and diverse negative samples. In this work, we present a framework that harnesses LLMs to synthesize high-quality hard negative samples. We first devise a \textit{multi-attribute self-reflection prompting strategy} to direct LLMs in hard negative sample generation. Then, we implement a \textit{hybrid sampling strategy} that integrates these synthetic negatives with traditionally retrieved negatives, thereby stabilizing the training process and improving retrieval performance. Extensive experiments on five benchmark datasets demonstrate the efficacy of our approach, and code is also publicly available.
