Table of Contents
Fetching ...

Syntriever: How to Train Your Retriever with Synthetic Data from LLMs

Minsang Kim, Seungjun Baek

TL;DR

Syntriever introduces a two-stage framework to distill black-box LLM knowledge into lightweight retrievers using synthetic data: Stage 1 synthesizes augmented queries, relevant passages, and hard negatives with self-verification, training the retriever with a cluster-friendly, soft-nearest-neighbor-like loss. Stage 2 aligns the retriever with LLM preferences via top-K retrieval, pairwise comparisons, and a partial Plackett-Luce ranking loss that regularizes to prevent deviation from the distillation stage. Across BeIR benchmarks, Syntriever achieves state-of-the-art or near-state-of-the-art results, with strong in-domain and zero-shot transfer, and demonstrates robustness across encoders and LLM choices. The approach highlights the practical viability of distilling LLM knowledge into compact retrievers using synthetic data and preference feedback, offering a scalable path for knowledge-intensive retrieval systems.

Abstract

LLMs have boosted progress in many AI applications. Recently, there were attempts to distill the vast knowledge of LLMs into information retrieval systems. Those distillation methods mostly use output probabilities of LLMs which are unavailable in the latest black-box LLMs. We propose Syntriever, a training framework for retrievers using synthetic data from black-box LLMs. Syntriever consists of two stages. Firstly in the distillation stage, we synthesize relevant and plausibly irrelevant passages and augmented queries using chain-of-thoughts for the given queries. LLM is asked to self-verify the synthetic data for possible hallucinations, after which retrievers are trained with a loss designed to cluster the embeddings of relevant passages. Secondly in the alignment stage, we align the retriever with the preferences of LLMs. We propose a preference modeling called partial Plackett-Luce ranking to learn LLM preferences with regularization which prevents the model from deviating excessively from that trained in the distillation stage. Experiments show that Syntriever achieves state-of-the-art performances on benchmark datasets from various domains in nDCG@$K$. The code is available at \href{https://github.com/kmswin1/Syntriever}{https://github.com/kmswin1/Syntriever}.

Syntriever: How to Train Your Retriever with Synthetic Data from LLMs

TL;DR

Syntriever introduces a two-stage framework to distill black-box LLM knowledge into lightweight retrievers using synthetic data: Stage 1 synthesizes augmented queries, relevant passages, and hard negatives with self-verification, training the retriever with a cluster-friendly, soft-nearest-neighbor-like loss. Stage 2 aligns the retriever with LLM preferences via top-K retrieval, pairwise comparisons, and a partial Plackett-Luce ranking loss that regularizes to prevent deviation from the distillation stage. Across BeIR benchmarks, Syntriever achieves state-of-the-art or near-state-of-the-art results, with strong in-domain and zero-shot transfer, and demonstrates robustness across encoders and LLM choices. The approach highlights the practical viability of distilling LLM knowledge into compact retrievers using synthetic data and preference feedback, offering a scalable path for knowledge-intensive retrieval systems.

Abstract

LLMs have boosted progress in many AI applications. Recently, there were attempts to distill the vast knowledge of LLMs into information retrieval systems. Those distillation methods mostly use output probabilities of LLMs which are unavailable in the latest black-box LLMs. We propose Syntriever, a training framework for retrievers using synthetic data from black-box LLMs. Syntriever consists of two stages. Firstly in the distillation stage, we synthesize relevant and plausibly irrelevant passages and augmented queries using chain-of-thoughts for the given queries. LLM is asked to self-verify the synthetic data for possible hallucinations, after which retrievers are trained with a loss designed to cluster the embeddings of relevant passages. Secondly in the alignment stage, we align the retriever with the preferences of LLMs. We propose a preference modeling called partial Plackett-Luce ranking to learn LLM preferences with regularization which prevents the model from deviating excessively from that trained in the distillation stage. Experiments show that Syntriever achieves state-of-the-art performances on benchmark datasets from various domains in nDCG@. The code is available at \href{https://github.com/kmswin1/Syntriever}{https://github.com/kmswin1/Syntriever}.

Paper Structure

This paper contains 30 sections, 19 equations, 6 figures, 13 tables.

Figures (6)

  • Figure 1: Overview of Syntriever. Stage-1 (Distillation Stage). Given a query, Syntriever uses LLMs to synthesize (i) related sub-queries (prompt $\mathcal{P}_\text{cot}$), (ii) relevant passages ( $\mathcal{P}_+$) which are self-verified for hallucination ( $\mathcal{P}_\text{Relabel}$), (iii) plausibly irrelevant passages ( $\mathcal{P}_-$). The retriever is trained with the synthetic positive and negative passages. Stage-2 (Alignment Stage). The retriever is aligned with the LLM preferences. LLM compares passage pairs from top-$K$ retrieved passages. If LLM prefers $y_w$ over $y_l$, we write $y_w\succ y_l$. We propose partial Plackett-Luce ranking to combine preference modeling and contrastive learning for the retriever to learn $y_w\succ y_l \succ$ {in-batch negatives}.
  • Figure 2: Example of LLM synthesis. The correct answer to the query is shown in red font.
  • Figure 3: Prompt template design for generating synthetic positive passages.
  • Figure 4: Prompt template design for generating plausible but irrelevant passages.
  • Figure 5: Prompt template design of relabeling for synthetic positive passages.
  • ...and 1 more figures