TriSampler: A Better Negative Sampling Principle for Dense Retrieval
Zhen Yang, Zhou Shao, Yuxiao Dong, Jie Tang
TL;DR
The paper addresses the lack of a general guiding principle for negative sampling in dense retrieval. It introduces the quasi-triangular principle, which constrains negatives within a triangular-like region relative to a query and its positive; the angular boundary is defined by $\theta = |acos( s(q,d^+) /(||q|| \cdot ||d^+||) ) - acos( s(q,d^-) /(||q|| \cdot ||d^-||) )|$, with a boundary at $60^\circ$. Building on this, TriSampler constructs informative negative candidates and applies two distributions that enforce the region: $p_d^{-(q)} \propto exp(- (s_- - s_+)^2 / 4)$ over topK and $p_d^- \propto ReLU( s(d^+,d^-) - s(q,d^-) )$. Empirical results across four benchmarks (NQ, TQA, MS MARCO Passage, MS MARCO Document) show TriSampler improves retrieval performance over prior negative sampling strategies, with faster convergence and scalable applicability to diverse dense retrievers. The findings suggest principled, region-constrained negative sampling yields more informative signals, enhancing both effectiveness and efficiency in dense retrieval systems.
Abstract
Negative sampling stands as a pivotal technique in dense retrieval, essential for training effective retrieval models and significantly impacting retrieval performance. While existing negative sampling methods have made commendable progress by leveraging hard negatives, a comprehensive guiding principle for constructing negative candidates and designing negative sampling distributions is still lacking. To bridge this gap, we embark on a theoretical analysis of negative sampling in dense retrieval. This exploration culminates in the unveiling of the quasi-triangular principle, a novel framework that elucidates the triangular-like interplay between query, positive document, and negative document. Fueled by this guiding principle, we introduce TriSampler, a straightforward yet highly effective negative sampling method. The keypoint of TriSampler lies in its ability to selectively sample more informative negatives within a prescribed constrained region. Experimental evaluation show that TriSampler consistently attains superior retrieval performance across a diverse of representative retrieval models.
