DeepRetrieval: Hacking Real Search Engines and Retrievers with Large Language Models via Reinforcement Learning

Pengcheng Jiang; Jiacheng Lin; Lang Cao; Runchu Tian; SeongKu Kang; Zifeng Wang; Jimeng Sun; Jiawei Han

DeepRetrieval: Hacking Real Search Engines and Retrievers with Large Language Models via Reinforcement Learning

Pengcheng Jiang, Jiacheng Lin, Lang Cao, Runchu Tian, SeongKu Kang, Zifeng Wang, Jimeng Sun, Jiawei Han

TL;DR

The paper tackles the cost and data demands of improving information retrieval by introducing DeepRetrieval, an RL-based framework that trains LLMs to generate augmented queries without supervised references. By optimizing retrieval metrics as rewards via PPO, and employing a reasoning-enhanced generation pipeline, the approach directly targets end-task performance across literature search, evidence-seeking, classic IR, and SQL search. The results show significant gains over state-of-the-art methods on real search engines and benchmarks, with strong parameter efficiency (3B); the method remains effective across retrievers and domains, and reveals nuanced phenomena like dataset-specific knowledge injection and evolving reasoning needs. These findings suggest a practical path toward more efficient, adaptable IR systems that can leverage existing retrievers without expensive supervision or distillation.

Abstract

Information retrieval systems are crucial for enabling effective access to large document collections. Recent approaches have leveraged Large Language Models (LLMs) to enhance retrieval performance through query augmentation, but often rely on expensive supervised learning or distillation techniques that require significant computational resources and hand-labeled data. We introduce DeepRetrieval, a reinforcement learning (RL) approach that trains LLMs for query generation through trial and error without supervised data (reference query). Using retrieval metrics as rewards, our system generates queries that maximize retrieval performance. DeepRetrieval outperforms leading methods on literature search with 65.07% (vs. previous SOTA 24.68%) recall for publication search and 63.18% (vs. previous SOTA 32.11%) recall for trial search using real-world search engines. DeepRetrieval also dominates in evidence-seeking retrieval, classic information retrieval and SQL database search. With only 3B parameters, it outperforms industry-leading models like GPT-4o and Claude-3.5-Sonnet on 11/13 datasets. These results demonstrate that our RL approach offers a more efficient and effective paradigm for information retrieval. Our data and code are available at: https://github.com/pat-jj/DeepRetrieval.

DeepRetrieval: Hacking Real Search Engines and Retrievers with Large Language Models via Reinforcement Learning

TL;DR

Abstract

DeepRetrieval: Hacking Real Search Engines and Retrievers with Large Language Models via Reinforcement Learning

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (18)