Table of Contents
Fetching ...

AI Agents Can Already Autonomously Perform Experimental High Energy Physics

Eric A. Moreno, Samuel Bright-Thonney, Andrzej Novak, Dolores Garcia, Philip Harris

Abstract

Large language model-based AI agents are now able to autonomously execute substantial portions of a high energy physics (HEP) analysis pipeline with minimal expert-curated input. Given access to a HEP dataset, an execution framework, and a corpus of prior experimental literature, we find that Claude Code succeeds in automating all stages of a typical analysis: event selection, background estimation, uncertainty quantification, statistical inference, and paper drafting. We argue that the experimental HEP community is underestimating the current capabilities of these systems, and that most proposed agentic workflows are too narrowly scoped or scaffolded to specific analysis structures. We present a proof-of-concept framework, Just Furnish Context (JFC), that integrates autonomous analysis agents with literature-based knowledge retrieval and multi-agent review, and show that this is sufficient to plan, execute, and document a credible high energy physics analysis. We demonstrate this by conducting analyses on open data from ALEPH, DELPHI, and CMS to perform electroweak, QCD, and Higgs boson measurements. Rather than replacing physicists, these tools promise to offload the repetitive technical burden of analysis code development, freeing researchers to focus on physics insight, truly novel method development, and rigorous validation. Given these developments, we advocate for new strategies for how the community trains students, organizes analysis efforts, and allocates human expertise.

AI Agents Can Already Autonomously Perform Experimental High Energy Physics

Abstract

Large language model-based AI agents are now able to autonomously execute substantial portions of a high energy physics (HEP) analysis pipeline with minimal expert-curated input. Given access to a HEP dataset, an execution framework, and a corpus of prior experimental literature, we find that Claude Code succeeds in automating all stages of a typical analysis: event selection, background estimation, uncertainty quantification, statistical inference, and paper drafting. We argue that the experimental HEP community is underestimating the current capabilities of these systems, and that most proposed agentic workflows are too narrowly scoped or scaffolded to specific analysis structures. We present a proof-of-concept framework, Just Furnish Context (JFC), that integrates autonomous analysis agents with literature-based knowledge retrieval and multi-agent review, and show that this is sufficient to plan, execute, and document a credible high energy physics analysis. We demonstrate this by conducting analyses on open data from ALEPH, DELPHI, and CMS to perform electroweak, QCD, and Higgs boson measurements. Rather than replacing physicists, these tools promise to offload the repetitive technical burden of analysis code development, freeing researchers to focus on physics insight, truly novel method development, and rigorous validation. Given these developments, we advocate for new strategies for how the community trains students, organizes analysis efforts, and allocates human expertise.
Paper Structure (531 sections, 99 equations, 139 figures, 3 tables)

This paper contains 531 sections, 99 equations, 139 figures, 3 tables.

Figures (139)

  • Figure 1: Diagram of how an AI-agent workflow can be used to mirror the typical high-energy physics analysis workflow. On the left, we show the typical analysis pipeline, which usually starts with legacy code that is then modified to perform the analysis. Analyses typically involve 3 or more levels of review, starting with feedback from other postdocs, students, and faculty collaborating on the analysis (office feedback). The next tier is typically done in an analysis subgroup (colloquially referred to as a level 3 group). The second tier of review involves a pre-approval phase, in which physics group conveners review an analysis, followed by a formal collaboration review leading to a result and a submission for publication. On the right side, an equivalent interactive workflow can be entirely handled by AI agents, from the conception of an idea through a result that would then undergo a similar collaboration review, followed by publication.
  • Figure 2: The JFC framework. A high-level physics objective is passed to an autonomous analysis agent, which plans and executes the full pipeline while querying a literature retrieval system (SciTreeRAG) for domain knowledge. The resulting analysis undergoes multi-agent review; if any reviewer flags an issue the agent revises and resubmits until all reviewers approve. Eventually the final document (a rich analysis note) is passed to human physicists for evaluation.
  • Figure 3: Energy scan structure showing the number of selected events at each centre-of-mass energy point, colored by data-taking year. The five energy groups used in the lineshape fit are indicated by the dashed vertical bands. The dominant contributions at the off-peak points come from the 1993 and 1995 energy scans, while the peak region receives events from all four years. The "above peak" group near 91.7 GeV contains only 2,380 events from 1995.
  • Figure 4: Energy distributions for each data-taking year. The 1992 and 1994 (P1, P2, P3) datasets cluster tightly around the peak energy, while 1993 and 1995 show the characteristic LEP energy scan pattern with measurements at the peak and approximately ±2 GeV off-peak. The 1995 dataset additionally includes a small sample near 91.7 GeV. The distinct energy coverage of different years motivates the two-tier luminosity strategy described in sec. \ref{['zls:sec:luminosity']}.
  • Figure 5: Distribution of charged hadron multiplicity ($n_{ch}$) after the hadronic event selection. Data (black points with statistical error bars) are compared to MC simulation (blue filled histogram) normalized to the same number of events. The mean multiplicity is approximately 20, characteristic of hadronic Z decays. The agreement between data and MC is excellent across the full range, with deviations below 2%. The low-multiplicity tail ($n_{ch} < 10$) is sensitive to two-photon and $\tau^+\tau^-$ backgrounds.
  • ...and 134 more figures