Table of Contents
Fetching ...

LLM-Augmented Retrieval: Enhancing Retrieval Models Through Language Models and Doc-Level Embedding

Mingrui Wu, Sheng Cao

TL;DR

The paper tackles improving information retrieval by introducing LLM-augmented retrieval, a framework that enriches doc-level embeddings with synthetic queries, titles, and passages generated by language models. It adapts this doc-level embedding to both bi-encoder and token-level late-interaction models, yielding state-of-the-art results on LoTTE and BEIR without additional fine-tuning. Key contributions include a concrete doc-level embedding formulation, insights from extensive ablations, and guidance for supervised fine-tuning via adaptive negative sampling and margin-based losses. The findings demonstrate substantial boosts in retrieval accuracy and robustness, with practical implications for building more context-aware retrievers at scale, albeit with higher computational costs and potential hallucination risks from synthetic data.

Abstract

Recently embedding-based retrieval or dense retrieval have shown state of the art results, compared with traditional sparse or bag-of-words based approaches. This paper introduces a model-agnostic doc-level embedding framework through large language model (LLM) augmentation. In addition, it also improves some important components in the retrieval model training process, such as negative sampling, loss function, etc. By implementing this LLM-augmented retrieval framework, we have been able to significantly improve the effectiveness of widely-used retriever models such as Bi-encoders (Contriever, DRAGON) and late-interaction models (ColBERTv2), thereby achieving state-of-the-art results on LoTTE datasets and BEIR datasets.

LLM-Augmented Retrieval: Enhancing Retrieval Models Through Language Models and Doc-Level Embedding

TL;DR

The paper tackles improving information retrieval by introducing LLM-augmented retrieval, a framework that enriches doc-level embeddings with synthetic queries, titles, and passages generated by language models. It adapts this doc-level embedding to both bi-encoder and token-level late-interaction models, yielding state-of-the-art results on LoTTE and BEIR without additional fine-tuning. Key contributions include a concrete doc-level embedding formulation, insights from extensive ablations, and guidance for supervised fine-tuning via adaptive negative sampling and margin-based losses. The findings demonstrate substantial boosts in retrieval accuracy and robustness, with practical implications for building more context-aware retrievers at scale, albeit with higher computational costs and potential hallucination risks from synthetic data.

Abstract

Recently embedding-based retrieval or dense retrieval have shown state of the art results, compared with traditional sparse or bag-of-words based approaches. This paper introduces a model-agnostic doc-level embedding framework through large language model (LLM) augmentation. In addition, it also improves some important components in the retrieval model training process, such as negative sampling, loss function, etc. By implementing this LLM-augmented retrieval framework, we have been able to significantly improve the effectiveness of widely-used retriever models such as Bi-encoders (Contriever, DRAGON) and late-interaction models (ColBERTv2), thereby achieving state-of-the-art results on LoTTE datasets and BEIR datasets.
Paper Structure (26 sections, 5 equations, 3 figures, 10 tables)

This paper contains 26 sections, 5 equations, 3 figures, 10 tables.

Figures (3)

  • Figure 1: Overall view on LLM-augmented retrieval framework. Synthetic relevant queries and synthetic titles are generated from LLM and then assembled into doc-level embedding together with chunks (passages) split from the original document. The final retrieval is based on the similarity between user query and the doc-level embedding.
  • Figure 2: Through synthetic relevant queries, the relevance relationship is not solely expressed by the similarity now but also expressed by the augmentation steps of the large language models
  • Figure 3: The graphic representation of "relevance" in doc-level embedding