PaSa: An LLM Agent for Comprehensive Academic Paper Search
Yichen He, Guanhua Huang, Peiyuan Feng, Yuan Lin, Yuchen Zhang, Hang Li, Weinan E
TL;DR
PaSa presents a two-agent LLM system for comprehensive academic search, leveraging a Crawler to explore search and citation networks and a Selector to ensure relevance, all trained within the AGILE reinforcement learning framework. It introduces AutoScholarQuery and RealScholarQuery datasets to train and evaluate the agent, showing that a 7B parameter PaSa model surpasses Google-based and GPT-based baselines on real-world queries. The work demonstrates the value of agentic search with long trajectories and sparse rewards, achieving strong recall improvements and robust ablations, while acknowledging limitations such as field specificity and model scale. Overall, PaSa advances automated, thorough literature surveying by integrating search, reading, and citation navigation into an end-to-end RL-trained system.
Abstract
We introduce PaSa, an advanced Paper Search agent powered by large language models. PaSa can autonomously make a series of decisions, including invoking search tools, reading papers, and selecting relevant references, to ultimately obtain comprehensive and accurate results for complex scholar queries. We optimize PaSa using reinforcement learning with a synthetic dataset, AutoScholarQuery, which includes 35k fine-grained academic queries and corresponding papers sourced from top-tier AI conference publications. Additionally, we develop RealScholarQuery, a benchmark collecting real-world academic queries to assess PaSa performance in more realistic scenarios. Despite being trained on synthetic data, PaSa significantly outperforms existing baselines on RealScholarQuery, including Google, Google Scholar, Google with GPT-4o for paraphrased queries, ChatGPT (search-enabled GPT-4o), GPT-o1, and PaSa-GPT-4o (PaSa implemented by prompting GPT-4o). Notably, PaSa-7B surpasses the best Google-based baseline, Google with GPT-4o, by 37.78% in recall@20 and 39.90% in recall@50, and exceeds PaSa-GPT-4o by 30.36% in recall and 4.25% in precision. Model, datasets, and code are available at https://github.com/bytedance/pasa.
