Table of Contents
Fetching ...

Agent-based code generation for the Gammapy framework

Dmitriy Kostunin, Vladimir Sotnikov, Sergo Golovachev, Abhay Mehta, Tim Lukas Holch, Elisa Jones

TL;DR

The paper addresses the challenge of turning Large Language Model guidance into reproducible Gammapy analysis scripts for gamma-ray astronomy. It introduces an agent-based system that writes, executes, and self-repairs Python scripts inside a sandbox, governed by strong prompting contracts, iterative validation, and optional retrieval-augmented context. A modular architecture (configuration, prompting, runner, execution, RAG) plus a benchmarking harness demonstrates high pass rates on common Gammapy tasks and provides a minimal web UI. This approach enables reliable, auditable, and deployable AI-assisted analysis for DL3 workflows and supports open-weight backends for privacy and reproducibility across ecosystems.

Abstract

Software code generation using Large Language Models (LLMs) is one of the most successful applications of modern artificial intelligence. Foundational models are very effective for popular frameworks that benefit from documentation, examples, and strong community support. In contrast, specialized scientific libraries often lack these resources and may expose unstable APIs under active development, making it difficult for models trained on limited or outdated data. We address these issues for the Gammapy library by developing an agent capable of writing, executing, and validating code in a controlled environment. We present a minimal web demo and an accompanying benchmarking suite. This contribution summarizes the design, reports our current status, and outlines next steps.

Agent-based code generation for the Gammapy framework

TL;DR

The paper addresses the challenge of turning Large Language Model guidance into reproducible Gammapy analysis scripts for gamma-ray astronomy. It introduces an agent-based system that writes, executes, and self-repairs Python scripts inside a sandbox, governed by strong prompting contracts, iterative validation, and optional retrieval-augmented context. A modular architecture (configuration, prompting, runner, execution, RAG) plus a benchmarking harness demonstrates high pass rates on common Gammapy tasks and provides a minimal web UI. This approach enables reliable, auditable, and deployable AI-assisted analysis for DL3 workflows and supports open-weight backends for privacy and reproducibility across ecosystems.

Abstract

Software code generation using Large Language Models (LLMs) is one of the most successful applications of modern artificial intelligence. Foundational models are very effective for popular frameworks that benefit from documentation, examples, and strong community support. In contrast, specialized scientific libraries often lack these resources and may expose unstable APIs under active development, making it difficult for models trained on limited or outdated data. We address these issues for the Gammapy library by developing an agent capable of writing, executing, and validating code in a controlled environment. We present a minimal web demo and an accompanying benchmarking suite. This contribution summarizes the design, reports our current status, and outlines next steps.

Paper Structure

This paper contains 17 sections, 2 figures.

Figures (2)

  • Figure 1: Left: Block diagram of the agent. Solid arrows form the generation–execution–validation loop; dashed arrows indicate retrieval of contextual snippets (tutorials, examples). Right: Screenshot of the Streamlit prototype (https://majestix-vm8.zeuthen.desy.de).
  • Figure 2: Coding benchmark results (attempts to pass and pass rates per task/model).