Table of Contents
Fetching ...

Artisan: Agentic Artifact Evaluation

Doehyun Baek, Michael Pradel

TL;DR

Artisan addresses the scalability and granularity limits of manual artifact evaluation by introducing an automated LLM-based agent that generates executable reproduction scripts to reproduce results from software engineering papers. It reframes reproduction as a code-generation task and couples it with a two-tier judging mechanism to ensure both correct outputs and meaningful reproduction methods, avoiding shortcuts like copied results. The authors contribute Artisan, the first automated artifact evaluation approach, and Artisan-Bench, a benchmark with 60 tasks from 23 papers, demonstrating that Artisan reproduces 44/60 cases and outperforms baselines by about 3.14× while revealing 20 paper-artifact inconsistencies. The work advances reproducibility in SE by enabling scalable, pre-submission, and continual evaluation of artifacts, with public release of code and data. The findings highlight practical benefits and limitations, including categories of failures and the importance of robust README guidance and artifact quality.

Abstract

Artifact evaluation has become standard practice in the software engineering community to ensure the reproducibility of research results. However, the current manual process is labor-intensive, and hence, done only as a one-time assessment for a subset of all papers. To support the artifact evaluation effort, we present Artisan, an automated LLM agent for reproducing research results given a paper and its artifact. The approach is enabled by two key contributions: First, we frame the reproduction problem as a code generation task where the goal is to generate a reproduction script that, when executed, reproduces the results reported in a paper. Unlike prior work on automatically reproducing research results in other domains, this formulation allows for running the script independently of the agent and for assessing the reproduction process at a fine-grained level. Second, we design automated judging mechanism that guides the agent toward the expected results without revealing them and that prevent trivial solutions, such as simply copying checked-in results. To evaluate Artisan, we introduce Artisan-Bench, the first benchmark assessing the ability to generate reproduction scripts and the first benchmark for automated artifact evaluation in software engineering. Artisan-Bench comprises 60 tasks derived from 23 software engineering papers, covering different research areas and programming languages. We validate all tasks in Artisan-Bench for reproducibility to ensure that the tasks are feasible. Our experiments show that Artisan is effective, producing 44/60 reproduction scripts and outperforming the best available baseline, a vanilla LLM agent (mini-swe-agent), by 3.14$\times$ in terms of reproduction scripts generated while taking $0.45 and 48 minutes, on average per task. Artisan also helped uncover 20 new errors in either the paper or artifact.

Artisan: Agentic Artifact Evaluation

TL;DR

Artisan addresses the scalability and granularity limits of manual artifact evaluation by introducing an automated LLM-based agent that generates executable reproduction scripts to reproduce results from software engineering papers. It reframes reproduction as a code-generation task and couples it with a two-tier judging mechanism to ensure both correct outputs and meaningful reproduction methods, avoiding shortcuts like copied results. The authors contribute Artisan, the first automated artifact evaluation approach, and Artisan-Bench, a benchmark with 60 tasks from 23 papers, demonstrating that Artisan reproduces 44/60 cases and outperforms baselines by about 3.14× while revealing 20 paper-artifact inconsistencies. The work advances reproducibility in SE by enabling scalable, pre-submission, and continual evaluation of artifacts, with public release of code and data. The findings highlight practical benefits and limitations, including categories of failures and the importance of robust README guidance and artifact quality.

Abstract

Artifact evaluation has become standard practice in the software engineering community to ensure the reproducibility of research results. However, the current manual process is labor-intensive, and hence, done only as a one-time assessment for a subset of all papers. To support the artifact evaluation effort, we present Artisan, an automated LLM agent for reproducing research results given a paper and its artifact. The approach is enabled by two key contributions: First, we frame the reproduction problem as a code generation task where the goal is to generate a reproduction script that, when executed, reproduces the results reported in a paper. Unlike prior work on automatically reproducing research results in other domains, this formulation allows for running the script independently of the agent and for assessing the reproduction process at a fine-grained level. Second, we design automated judging mechanism that guides the agent toward the expected results without revealing them and that prevent trivial solutions, such as simply copying checked-in results. To evaluate Artisan, we introduce Artisan-Bench, the first benchmark assessing the ability to generate reproduction scripts and the first benchmark for automated artifact evaluation in software engineering. Artisan-Bench comprises 60 tasks derived from 23 software engineering papers, covering different research areas and programming languages. We validate all tasks in Artisan-Bench for reproducibility to ensure that the tasks are feasible. Our experiments show that Artisan is effective, producing 44/60 reproduction scripts and outperforming the best available baseline, a vanilla LLM agent (mini-swe-agent), by 3.14 in terms of reproduction scripts generated while taking $0.45 and 48 minutes, on average per task. Artisan also helped uncover 20 new errors in either the paper or artifact.
Paper Structure (43 sections, 10 figures, 6 tables, 1 algorithm)

This paper contains 43 sections, 10 figures, 6 tables, 1 algorithm.

Figures (10)

  • Figure 1: Overview of Artisan.
  • Figure 2: Challenges in automated artifact evaluation.
  • Figure 3: Example obfuscation and execution outputs for Table 3 of ScType sctype.
  • Figure 4: Example of the mismatched results feedback for Table 2 of Drosos et al. bloat.
  • Figure 5: Examples of three kinds of reproduction methods.
  • ...and 5 more figures