Table of Contents
Fetching ...

Large Language Models are Built-in Autoregressive Search Engines

Noah Ziems, Wenhao Yu, Zhihan Zhang, Meng Jiang

TL;DR

The paper investigates whether large language models can serve as built-in autoregressive search engines by directly generating Web URLs for document retrieval. By using in-context demonstrations, LLMs can produce valid URLs whose linked documents often contain answer spans, enabling a retrieval approach that rivals or surpasses traditional retrievers on open-domain QA benchmarks. The method leverages existing Internet identifiers (Wikipedia URLs) and a lightweight passage filtering step to feed a reader, achieving substantial improvements in Recall@k and downstream EM on WebQ, NQ, and TriviaQA, including time-sensitive retrieval advantages. Limitations include sensitivity to knowledge updates and potential hallucinations, motivating future work on prompt design and diverse demonstrations.

Abstract

Document retrieval is a key stage of standard Web search engines. Existing dual-encoder dense retrievers obtain representations for questions and documents independently, allowing for only shallow interactions between them. To overcome this limitation, recent autoregressive search engines replace the dual-encoder architecture by directly generating identifiers for relevant documents in the candidate pool. However, the training cost of such autoregressive search engines rises sharply as the number of candidate documents increases. In this paper, we find that large language models (LLMs) can follow human instructions to directly generate URLs for document retrieval. Surprisingly, when providing a few {Query-URL} pairs as in-context demonstrations, LLMs can generate Web URLs where nearly 90\% of the corresponding documents contain correct answers to open-domain questions. In this way, LLMs can be thought of as built-in search engines, since they have not been explicitly trained to map questions to document identifiers. Experiments demonstrate that our method can consistently achieve better retrieval performance than existing retrieval approaches by a significant margin on three open-domain question answering benchmarks, under both zero and few-shot settings. The code for this work can be found at \url{https://github.com/Ziems/llm-url}.

Large Language Models are Built-in Autoregressive Search Engines

TL;DR

The paper investigates whether large language models can serve as built-in autoregressive search engines by directly generating Web URLs for document retrieval. By using in-context demonstrations, LLMs can produce valid URLs whose linked documents often contain answer spans, enabling a retrieval approach that rivals or surpasses traditional retrievers on open-domain QA benchmarks. The method leverages existing Internet identifiers (Wikipedia URLs) and a lightweight passage filtering step to feed a reader, achieving substantial improvements in Recall@k and downstream EM on WebQ, NQ, and TriviaQA, including time-sensitive retrieval advantages. Limitations include sensitivity to knowledge updates and potential hallucinations, motivating future work on prompt design and diverse demonstrations.

Abstract

Document retrieval is a key stage of standard Web search engines. Existing dual-encoder dense retrievers obtain representations for questions and documents independently, allowing for only shallow interactions between them. To overcome this limitation, recent autoregressive search engines replace the dual-encoder architecture by directly generating identifiers for relevant documents in the candidate pool. However, the training cost of such autoregressive search engines rises sharply as the number of candidate documents increases. In this paper, we find that large language models (LLMs) can follow human instructions to directly generate URLs for document retrieval. Surprisingly, when providing a few {Query-URL} pairs as in-context demonstrations, LLMs can generate Web URLs where nearly 90\% of the corresponding documents contain correct answers to open-domain questions. In this way, LLMs can be thought of as built-in search engines, since they have not been explicitly trained to map questions to document identifiers. Experiments demonstrate that our method can consistently achieve better retrieval performance than existing retrieval approaches by a significant margin on three open-domain question answering benchmarks, under both zero and few-shot settings. The code for this work can be found at \url{https://github.com/Ziems/llm-url}.
Paper Structure (26 sections, 5 figures, 5 tables)

This paper contains 26 sections, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Successful URL reconstructions by different size of GPT-3 as URL generators. The models are prompted with the first 100 words of a Wikipedia page then tasked with generating the URL of the page they came from. Tested on 10k Wikipedia pages sampled from the top 100k most frequent Wikipedia entities.
  • Figure 2: The overall pipeline of our proposed LLM-URL. Given a question, LLM-URL first generates a set of URLs which are extracted from the generated text. The URLs are retrieved from the Internet then broken into passages which are ranked and filtered such that only the most relevant are kept. Finally, these passages are given as input to a reader model along with the original question to generate a final answer.
  • Figure 3: We prompt LLM-URL to generate $m$ documents and measure the recall as $m$ increases. Significant recall improvements are seen when $m$ is small but as it increases the marginal benefit decreases.
  • Figure 4: The percentage of valid URLs generated from LLM-URL as the total number of generated URLs $m$ increases from 1 to 10. As $m$ increases, invalid URL generations become more frequent.
  • Figure 5: Common vs Uncommon entity recall@1. Common is defined as a question containing entities in the top-1 million most common entities on Wikipedia. LLM-URL performs much better when retrieving information about common entities.