Table of Contents
Fetching ...

Corpus-Steered Query Expansion with Large Language Models

Yibin Lei, Yu Cao, Tianyi Zhou, Tao Shen, Andrew Yates

TL;DR

Inspired by Pseudo Relevance Feedback, Corpus-Steered Query Expansion (CSQE) is introduced to promote the incorporation of knowledge embedded within the corpus to promote the incorporation of knowledge embedded within the corpus.

Abstract

Recent studies demonstrate that query expansions generated by large language models (LLMs) can considerably enhance information retrieval systems by generating hypothetical documents that answer the queries as expansions. However, challenges arise from misalignments between the expansions and the retrieval corpus, resulting in issues like hallucinations and outdated information due to the limited intrinsic knowledge of LLMs. Inspired by Pseudo Relevance Feedback (PRF), we introduce Corpus-Steered Query Expansion (CSQE) to promote the incorporation of knowledge embedded within the corpus. CSQE utilizes the relevance assessing capability of LLMs to systematically identify pivotal sentences in the initially-retrieved documents. These corpus-originated texts are subsequently used to expand the query together with LLM-knowledge empowered expansions, improving the relevance prediction between the query and the target documents. Extensive experiments reveal that CSQE exhibits strong performance without necessitating any training, especially with queries for which LLMs lack knowledge.

Corpus-Steered Query Expansion with Large Language Models

TL;DR

Inspired by Pseudo Relevance Feedback, Corpus-Steered Query Expansion (CSQE) is introduced to promote the incorporation of knowledge embedded within the corpus to promote the incorporation of knowledge embedded within the corpus.

Abstract

Recent studies demonstrate that query expansions generated by large language models (LLMs) can considerably enhance information retrieval systems by generating hypothetical documents that answer the queries as expansions. However, challenges arise from misalignments between the expansions and the retrieval corpus, resulting in issues like hallucinations and outdated information due to the limited intrinsic knowledge of LLMs. Inspired by Pseudo Relevance Feedback (PRF), we introduce Corpus-Steered Query Expansion (CSQE) to promote the incorporation of knowledge embedded within the corpus. CSQE utilizes the relevance assessing capability of LLMs to systematically identify pivotal sentences in the initially-retrieved documents. These corpus-originated texts are subsequently used to expand the query together with LLM-knowledge empowered expansions, improving the relevance prediction between the query and the target documents. Extensive experiments reveal that CSQE exhibits strong performance without necessitating any training, especially with queries for which LLMs lack knowledge.
Paper Structure (18 sections, 1 figure, 10 tables)

This paper contains 18 sections, 1 figure, 10 tables.

Figures (1)

  • Figure 1: Overview of CSQE. Given a query Biology definition and the top-2 retrieved documents, CSQE utilizes an LLM to identify relevant document 1 and extract the key sentences from document 1 that contribute to the relevance. The query is then expanded by both these corpus-originated texts and LLM-knowledge empowered expansions (i.e., hypothetical documents that answer the query) to obtain the final results.