Table of Contents
Fetching ...

AgentRxiv: Towards Collaborative Autonomous Research

Samuel Schmidgall, Michael Moor

TL;DR

AgentRxiv presents a centralized, open-source framework enabling collaborative autonomous research by multiple LLM agent laboratories that upload and reuse research via a shared preprint server. The study demonstrates that access to prior agent work improves performance on MATH-500 and generalizes to GPQA, MMLU-Pro, and MedQA, with parallel laboratories accelerating discovery at higher compute cost. It analyzes the discovery and generalization of reasoning techniques (notably Simultaneous Divergence Averaging) across benchmarks and models, and discusses the trade-offs between parallelization and efficiency. The work also candidly addresses limitations such as hallucinations, failure modes, and ethical considerations, outlining future verification and safety enhancements necessary for responsible autonomous scientific progress.

Abstract

Progress in scientific discovery is rarely the result of a single "Eureka" moment, but is rather the product of hundreds of scientists incrementally working together toward a common goal. While existing agent workflows are capable of producing research autonomously, they do so in isolation, without the ability to continuously improve upon prior research results. To address these challenges, we introduce AgentRxiv-a framework that lets LLM agent laboratories upload and retrieve reports from a shared preprint server in order to collaborate, share insights, and iteratively build on each other's research. We task agent laboratories to develop new reasoning and prompting techniques and find that agents with access to their prior research achieve higher performance improvements compared to agents operating in isolation (11.4% relative improvement over baseline on MATH-500). We find that the best performing strategy generalizes to benchmarks in other domains (improving on average by 3.3%). Multiple agent laboratories sharing research through AgentRxiv are able to work together towards a common goal, progressing more rapidly than isolated laboratories, achieving higher overall accuracy (13.7% relative improvement over baseline on MATH-500). These findings suggest that autonomous agents may play a role in designing future AI systems alongside humans. We hope that AgentRxiv allows agents to collaborate toward research goals and enables researchers to accelerate discovery.

AgentRxiv: Towards Collaborative Autonomous Research

TL;DR

AgentRxiv presents a centralized, open-source framework enabling collaborative autonomous research by multiple LLM agent laboratories that upload and reuse research via a shared preprint server. The study demonstrates that access to prior agent work improves performance on MATH-500 and generalizes to GPQA, MMLU-Pro, and MedQA, with parallel laboratories accelerating discovery at higher compute cost. It analyzes the discovery and generalization of reasoning techniques (notably Simultaneous Divergence Averaging) across benchmarks and models, and discusses the trade-offs between parallelization and efficiency. The work also candidly addresses limitations such as hallucinations, failure modes, and ethical considerations, outlining future verification and safety enhancements necessary for responsible autonomous scientific progress.

Abstract

Progress in scientific discovery is rarely the result of a single "Eureka" moment, but is rather the product of hundreds of scientists incrementally working together toward a common goal. While existing agent workflows are capable of producing research autonomously, they do so in isolation, without the ability to continuously improve upon prior research results. To address these challenges, we introduce AgentRxiv-a framework that lets LLM agent laboratories upload and retrieve reports from a shared preprint server in order to collaborate, share insights, and iteratively build on each other's research. We task agent laboratories to develop new reasoning and prompting techniques and find that agents with access to their prior research achieve higher performance improvements compared to agents operating in isolation (11.4% relative improvement over baseline on MATH-500). We find that the best performing strategy generalizes to benchmarks in other domains (improving on average by 3.3%). Multiple agent laboratories sharing research through AgentRxiv are able to work together towards a common goal, progressing more rapidly than isolated laboratories, achieving higher overall accuracy (13.7% relative improvement over baseline on MATH-500). These findings suggest that autonomous agents may play a role in designing future AI systems alongside humans. We hope that AgentRxiv allows agents to collaborate toward research goals and enables researchers to accelerate discovery.

Paper Structure

This paper contains 39 sections, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Collaborative Autonomous Research via AgentRxiv. Autonomous agent laboratories distributed collaboratively pursue a shared research goal using AgentRxiv. Human researchers provide initial guidance through a research direction and detailed instructions. Agents autonomously perform research and upload research papers to the centralized AgentRxiv preprint server, enabling laboratories to access each other's discoveries, accelerating scientific progress.
  • Figure 2: Agent Laboratory Workflow. (Top) This image shows Agent Laboratory's three phases: Literature Review, Experimentation, and Report Writing. Human researchers collaborate with AI agents (e.g., PhD, Postdoc) and specialized tools (mle-solver, paper-solver) to automate tasks and produce high-quality research outputs. (Bottom) This
  • Figure 3: AgentRxiv Framework for Autonomous Research Collaboration. Depicted are two independent autonomous agent laboratories interacting through the centralized archival preprint server, AgentRxiv. (Left) Laboratory #1 submits a search query to AgentRxiv, retrieving relevant research papers published by other agent laboratories. (Right) Laboratory #2 completes and uploads its research findings to AgentRxiv, making the research accessible for retrieval and use by other autonomous laboratories. This workflow enables efficient knowledge sharing and iterative progress among independent agent systems.
  • Figure 4: Designing Novel Reasoning Techniques on MATH-500. Progression of a single autonomous laboratory iteratively designing reasoning techniques to improve accuracy on the MATH-500 benchmark using gpt-4o mini as the base model. Call-outs indicate the discovery of techniques that set a new highest accuracy on the test set. Techniques such as Progressive Confidence Cascade (PCC), Dynamic Critical Chain Prompting (DCCP), and Dual Anchor Cross-Verification Prompting (DACVP) incrementally increased accuracy from a baseline of 70.2% (gpt-4o mini zero-shot) up to 78.2% (+11.4%) with the final discovered method, Simultaneous Divergence Averaging (SDA).
  • Figure 5: Properties of autonomous discovery.A. The discovered algorithm, Simultaneous Divergence Averaging (SDA), demonstrates generality beyond its original discovery benchmark (MATH-500) to three distinct reasoning benchmarks (MedQA, MMLU-Pro, and GPQA). SDA (blue) consistently improves accuracy compared to 0-shot prompting (gray) across diverse tasks. B. Comparison of best accuracy obtained on MATH-500 when agents have access to previously generated research (green) versus no access (pink). Agents referencing prior research consistently achieve higher performance, indicating the value of cumulative knowledge integration. C. The discovered SDA algorithm generalizes effectively across multiple language models (gpt-4o mini, gpt-4o, DeepSeek v3, Gemini-1.5-Pro, Gemini-2.0-Flash) and across several reasoning benchmarks. SDA (blue) demonstrates higher average accuracy compared to 0-shot prompting (gray).
  • ...and 1 more figures