Table of Contents
Fetching ...

ARIS: Autonomous Research via Adversarial Multi-Agent Collaboration

Ruofeng Yang, Yongcan Li, Shuai Li

Abstract

This report describes ARIS (Auto-Research-in-sleep), an open-source research harness for autonomous research, including its architecture, assurance mechanisms, and early deployment experience. The performance of agent systems built on LLMs depends on both the model weights and the harness around them, which governs what information to store, retrieve, and present to the model. For long-horizon research workflows, the central failure mode is not a visible breakdown but a plausible unsupported success: a long-running agent can produce claims whose evidential support is incomplete, misreported, or silently inherited from the executor's framing. Therefore, we present ARIS as a research harness that coordinates machine-learning research workflows through cross-model adversarial collaboration as a default configuration: an executor model drives forward progress while a reviewer from a different model family is recommended to critique intermediate artifacts and request revisions. ARIS has three architectural layers. The execution layer provides more than 65 reusable Markdown-defined skills, model integrations via MCP, a persistent research wiki for iterative reuse of prior findings, and deterministic figure generation. The orchestration layer coordinates five end-to-end workflows with adjustable effort settings and configurable routing to reviewer models. The assurance layer includes a three-stage process for checking whether experimental claims are supported by evidence: integrity verification, result-to-claim mapping, and claim auditing that cross-checks manuscript statements against the claim ledger and raw evidence, as well as a five-pass scientific-editing pipeline, mathematical-proof checks, and visual inspection of the rendered PDF. A prototype self-improvement loop records research traces and proposes harness improvements that are adopted only after reviewer approval.

ARIS: Autonomous Research via Adversarial Multi-Agent Collaboration

Abstract

This report describes ARIS (Auto-Research-in-sleep), an open-source research harness for autonomous research, including its architecture, assurance mechanisms, and early deployment experience. The performance of agent systems built on LLMs depends on both the model weights and the harness around them, which governs what information to store, retrieve, and present to the model. For long-horizon research workflows, the central failure mode is not a visible breakdown but a plausible unsupported success: a long-running agent can produce claims whose evidential support is incomplete, misreported, or silently inherited from the executor's framing. Therefore, we present ARIS as a research harness that coordinates machine-learning research workflows through cross-model adversarial collaboration as a default configuration: an executor model drives forward progress while a reviewer from a different model family is recommended to critique intermediate artifacts and request revisions. ARIS has three architectural layers. The execution layer provides more than 65 reusable Markdown-defined skills, model integrations via MCP, a persistent research wiki for iterative reuse of prior findings, and deterministic figure generation. The orchestration layer coordinates five end-to-end workflows with adjustable effort settings and configurable routing to reviewer models. The assurance layer includes a three-stage process for checking whether experimental claims are supported by evidence: integrity verification, result-to-claim mapping, and claim auditing that cross-checks manuscript statements against the claim ledger and raw evidence, as well as a five-pass scientific-editing pipeline, mathematical-proof checks, and visual inspection of the rendered PDF. A prototype self-improvement loop records research traces and proposes harness improvements that are adopted only after reviewer approval.

Paper Structure

This paper contains 56 sections, 12 figures, 7 tables.

Figures (12)

  • Figure 1: Aris workflow library. Top: end-to-end composition of the five workflows and their artifact contracts, grouped into four research phases (Discovery, Experimentation, Manuscript, Post-Submission); dashed links denote reviewer feedback, GPU-triggered evidence collection, and wiki memory. Bottom: compressed internal structure for the workflows not otherwise expanded in the main text---W1 idea discovery (with reviewer-gated refinement), W1.5 experiment bridge (with code review and auto-debug fallback), and W4 rebuttal (with safety gates and stress test). W2 auto-review and W3 paper writing internals are detailed separately in Figures \ref{['fig:wf2']} and \ref{['fig:wf3']}.
  • Figure 2: Workflow 2: Auto Review Loop. Each round submits the draft to a cross-model reviewer for structured scoring, extracts action items, optionally runs GPU experiments for new evidence, revises affected sections, and checks convergence. The loop terminates when the score exceeds a predefined threshold or after a preset maximum of rounds.
  • Figure 3: Workflow 3: Paper Writing Pipeline. Three phases: Plan & Generate (outline, figures), Draft & Assure (LaTeX drafting with five-pass editing, optional proof checking, claim auditing), and Compile & Improve (compilation, two rounds of GPT-5.4 xhigh visual review with automatic revision).
  • Figure 4: Aris system topology. Six component groups interact through labeled relationships (left margin): the Meta-Optimization outer loop gates the Assurance layer, which checks Artifacts; artifacts are produced and consumed by Workflows, which orchestrate Skills; skills call MCP & Tool Bridges for external model and data access. The executor and reviewer (right) use models from different families. ARIS-Code CLI bundles all components into a standalone binary.
  • Figure 5: Cross-model adversarial collaboration alternates executor generation with external-model critique, actionable revision requests, and convergence checking. Reviewer access ranges from document-only to repository-level.
  • ...and 7 more figures

Theorems & Definitions (1)

  • Remark 1: Discussion of Human in the Loop