Table of Contents
Fetching ...

PaSa: An LLM Agent for Comprehensive Academic Paper Search

Yichen He, Guanhua Huang, Peiyuan Feng, Yuan Lin, Yuchen Zhang, Hang Li, Weinan E

TL;DR

PaSa presents a two-agent LLM system for comprehensive academic search, leveraging a Crawler to explore search and citation networks and a Selector to ensure relevance, all trained within the AGILE reinforcement learning framework. It introduces AutoScholarQuery and RealScholarQuery datasets to train and evaluate the agent, showing that a 7B parameter PaSa model surpasses Google-based and GPT-based baselines on real-world queries. The work demonstrates the value of agentic search with long trajectories and sparse rewards, achieving strong recall improvements and robust ablations, while acknowledging limitations such as field specificity and model scale. Overall, PaSa advances automated, thorough literature surveying by integrating search, reading, and citation navigation into an end-to-end RL-trained system.

Abstract

We introduce PaSa, an advanced Paper Search agent powered by large language models. PaSa can autonomously make a series of decisions, including invoking search tools, reading papers, and selecting relevant references, to ultimately obtain comprehensive and accurate results for complex scholar queries. We optimize PaSa using reinforcement learning with a synthetic dataset, AutoScholarQuery, which includes 35k fine-grained academic queries and corresponding papers sourced from top-tier AI conference publications. Additionally, we develop RealScholarQuery, a benchmark collecting real-world academic queries to assess PaSa performance in more realistic scenarios. Despite being trained on synthetic data, PaSa significantly outperforms existing baselines on RealScholarQuery, including Google, Google Scholar, Google with GPT-4o for paraphrased queries, ChatGPT (search-enabled GPT-4o), GPT-o1, and PaSa-GPT-4o (PaSa implemented by prompting GPT-4o). Notably, PaSa-7B surpasses the best Google-based baseline, Google with GPT-4o, by 37.78% in recall@20 and 39.90% in recall@50, and exceeds PaSa-GPT-4o by 30.36% in recall and 4.25% in precision. Model, datasets, and code are available at https://github.com/bytedance/pasa.

PaSa: An LLM Agent for Comprehensive Academic Paper Search

TL;DR

PaSa presents a two-agent LLM system for comprehensive academic search, leveraging a Crawler to explore search and citation networks and a Selector to ensure relevance, all trained within the AGILE reinforcement learning framework. It introduces AutoScholarQuery and RealScholarQuery datasets to train and evaluate the agent, showing that a 7B parameter PaSa model surpasses Google-based and GPT-based baselines on real-world queries. The work demonstrates the value of agentic search with long trajectories and sparse rewards, achieving strong recall improvements and robust ablations, while acknowledging limitations such as field specificity and model scale. Overall, PaSa advances automated, thorough literature surveying by integrating search, reading, and citation navigation into an end-to-end RL-trained system.

Abstract

We introduce PaSa, an advanced Paper Search agent powered by large language models. PaSa can autonomously make a series of decisions, including invoking search tools, reading papers, and selecting relevant references, to ultimately obtain comprehensive and accurate results for complex scholar queries. We optimize PaSa using reinforcement learning with a synthetic dataset, AutoScholarQuery, which includes 35k fine-grained academic queries and corresponding papers sourced from top-tier AI conference publications. Additionally, we develop RealScholarQuery, a benchmark collecting real-world academic queries to assess PaSa performance in more realistic scenarios. Despite being trained on synthetic data, PaSa significantly outperforms existing baselines on RealScholarQuery, including Google, Google Scholar, Google with GPT-4o for paraphrased queries, ChatGPT (search-enabled GPT-4o), GPT-o1, and PaSa-GPT-4o (PaSa implemented by prompting GPT-4o). Notably, PaSa-7B surpasses the best Google-based baseline, Google with GPT-4o, by 37.78% in recall@20 and 39.90% in recall@50, and exceeds PaSa-GPT-4o by 30.36% in recall and 4.25% in precision. Model, datasets, and code are available at https://github.com/bytedance/pasa.
Paper Structure (39 sections, 8 equations, 3 figures, 20 tables)

This paper contains 39 sections, 8 equations, 3 figures, 20 tables.

Figures (3)

  • Figure 1: Architecture of PaSa. The system consists of two LLM agents, Crawler and Selector. The Crawler processes the user query and can access papers from the paper queue. It can autonomously invoke the search tool, expand citations, or stop processing of the current paper. All papers collected by the Crawler are appended to the paper queue. The Selector reads each paper in the paper queue to determine whether it meets the criteria specified in the user query.
  • Figure 2: An example of the PaSa workflow. The Crawler runs multiple [Search] using diverse and complementary queries. In addition, the Crawler can evaluate the long-term value of its actions. Notably, it discovers many relevant papers as it explores deeper in the citation network, even when intermediate papers along the path do not align with the user query.
  • Figure 3: Return and value function loss curves during the PPO training process. The smoothing method of the curve in the figures is the exponential moving average(EMA) formula that aligns with the one used in TensorBoard, and the smoothing weight is set to 0.95.