Table of Contents
Fetching ...

Agents of Discovery

Sascha Diefenbacher, Anna Hallin, Gregor Kasieczka, Michael Krämer, Anne Lauscher, Tim Lukas

TL;DR

This work investigates a new and orthogonal direction: Using recent progress in large language models (LLMs) to create a team of agents that jointly solve data analysis-based research problems in a way similar to how a human researcher might: by creating code to operate standard tools and libraries and by building on results of previous iterations.

Abstract

The substantial data volumes encountered in modern particle physics and other domains of fundamental physics research allow (and require) the use of increasingly complex data analysis tools and workflows. While the use of machine learning (ML) tools for data analysis has recently proliferated, these tools are typically special-purpose algorithms that rely, for example, on encoded physics knowledge to reach optimal performance. In this work, we investigate a new and orthogonal direction: Using recent progress in large language models (LLMs) to create a team of agents -- instances of LLMs with specific subtasks -- that jointly solve data analysis-based research problems in a way similar to how a human researcher might: by creating code to operate standard tools and libraries (including ML systems) and by building on results of previous iterations. If successful, such agent-based systems could be deployed to automate routine analysis components to counteract the increasing complexity of modern tool chains. To investigate the capabilities of current-generation commercial LLMs, we consider the task of anomaly detection via the publicly available and highly-studied LHC Olympics dataset. Several current models by OpenAI (GPT-4o, o4-mini, GPT-4.1, and GPT-5) are investigated and their stability tested. Overall, we observe the capacity of the agent-based system to solve this data analysis problem. The best agent-created solutions mirror the performance of human state-of-the-art results.

Agents of Discovery

TL;DR

This work investigates a new and orthogonal direction: Using recent progress in large language models (LLMs) to create a team of agents that jointly solve data analysis-based research problems in a way similar to how a human researcher might: by creating code to operate standard tools and libraries and by building on results of previous iterations.

Abstract

The substantial data volumes encountered in modern particle physics and other domains of fundamental physics research allow (and require) the use of increasingly complex data analysis tools and workflows. While the use of machine learning (ML) tools for data analysis has recently proliferated, these tools are typically special-purpose algorithms that rely, for example, on encoded physics knowledge to reach optimal performance. In this work, we investigate a new and orthogonal direction: Using recent progress in large language models (LLMs) to create a team of agents -- instances of LLMs with specific subtasks -- that jointly solve data analysis-based research problems in a way similar to how a human researcher might: by creating code to operate standard tools and libraries (including ML systems) and by building on results of previous iterations. If successful, such agent-based systems could be deployed to automate routine analysis components to counteract the increasing complexity of modern tool chains. To investigate the capabilities of current-generation commercial LLMs, we consider the task of anomaly detection via the publicly available and highly-studied LHC Olympics dataset. Several current models by OpenAI (GPT-4o, o4-mini, GPT-4.1, and GPT-5) are investigated and their stability tested. Overall, we observe the capacity of the agent-based system to solve this data analysis problem. The best agent-created solutions mirror the performance of human state-of-the-art results.

Paper Structure

This paper contains 79 sections, 14 figures, 3 tables.

Figures (14)

  • Figure 1: Sketch of the agentic framework. The main agent is the Researcher, which orchestrates the project. It can communicate with the Coder and the Logic reviewer through the use of tools. The Coder has additionally access to a Code reviewer agent. The researcher handles its tasks via a Task manager. All code is run on the user's local machine, and no agent has direct access to the raw data. A bar with two outgoing arrows means that both things the arrows point to happen, it is a fork rather than a decision point. The Local Machine contains shared services that may be accessed by different agents.
  • Figure 2: Graph showing the dependencies of the different prompts. Boxes refer to a singular prompt while ellipses refer to a family of prompts, see Appendix \ref{['app:prompts']} for details.
  • Figure 3: Comparison of four OpenAI models with the ML prompt across different high-level behavior metrics: number of calls, response time, completion time, input tokens, output tokens and total cost (see Appendix \ref{['app:metrics']} for the exact definition of these quantities). Only successful runs are shown. The mean is marked with a line and the one standard deviation with a shaded box.
  • Figure 4: Comparison of four OpenAI models with the ML prompt across different metrics related to coding: execution time, execute python errors, lint errors, number of different coders used, tool calls handoff to coder and failed reviews (see Appendix \ref{['app:metrics']} for the exact definition of these quantities). Only successful runs are shown. The mean is marked with a line and the one standard deviation with a shaded box.
  • Figure 5: Reported values of $m_{res}$, p-value and signal percentage on the LHCO R&D dataset, plus the max SIC calculated after the agent submitted its scores and ended its run. Only successful runs are shown. The mean is marked with a line and the one standard deviation with a shaded box. The ML prompt is used in all runs.
  • ...and 9 more figures