Table of Contents
Fetching ...

ATime-Consistent Benchmark for Repository-Level Software Engineering Evaluation

Xianpeng, Sun, Haonan Sun, Tian Yu, Sheng Ma, Qincheng Zhang, Lifei Rao, Chen Tian

Abstract

Evaluation of repository-aware software engineering systems is often confounded by synthetic task design, prompt leakage, and temporal contamination between repository knowledge and future code changes. We present a time-consistent benchmark methodology that snapshots a repository at time T0, constructs repository-derived code knowledge using only artifacts available before T0, and evaluates on engineering tasks derived from pull requests merged in the future interval (T0, T1]. Each historical pull request is transformed into a natural-language task through an LLM-assisted prompt-generation pipeline, and the benchmark is formalized as a matched A/B comparison in which the same software engineering agent is evaluated with and without repository-derived code knowledge while all other variables are held constant. We also report a baseline characterization study on two open-source repositories, DragonFly and React, using three Claude-family models and four prompt granularities. Across both repositories, file-level F1 increases monotonically from minimal to guided prompts, reaching 0.8081 on DragonFly and 0.8078 on React for the strongest tested model. These results show that prompt construction is a first-order benchmark variable. More broadly, the benchmark highlights that temporal consistency and prompt control are core validity requirements for repository-aware software engineering evaluation.

ATime-Consistent Benchmark for Repository-Level Software Engineering Evaluation

Abstract

Evaluation of repository-aware software engineering systems is often confounded by synthetic task design, prompt leakage, and temporal contamination between repository knowledge and future code changes. We present a time-consistent benchmark methodology that snapshots a repository at time T0, constructs repository-derived code knowledge using only artifacts available before T0, and evaluates on engineering tasks derived from pull requests merged in the future interval (T0, T1]. Each historical pull request is transformed into a natural-language task through an LLM-assisted prompt-generation pipeline, and the benchmark is formalized as a matched A/B comparison in which the same software engineering agent is evaluated with and without repository-derived code knowledge while all other variables are held constant. We also report a baseline characterization study on two open-source repositories, DragonFly and React, using three Claude-family models and four prompt granularities. Across both repositories, file-level F1 increases monotonically from minimal to guided prompts, reaching 0.8081 on DragonFly and 0.8078 on React for the strongest tested model. These results show that prompt construction is a first-order benchmark variable. More broadly, the benchmark highlights that temporal consistency and prompt control are core validity requirements for repository-aware software engineering evaluation.

Paper Structure

This paper contains 18 sections, 5 equations, 10 figures, 4 tables.

Figures (10)

  • Figure 1: Time-consistent benchmark pipeline. A repository is snapshotted at $T_0$ to construct repository knowledge using only pre-$T_0$ information. Pull requests merged during $(T_0,T_1]$ are converted into natural-language tasks. The same SWE agent is then evaluated in matched conditions with and without repository-derived code knowledge, and outputs are compared against the historical ground-truth pull requests.
  • Figure 2: DragonFly F1 by prompt granularity and model.
  • Figure 3: DragonFly metric trajectories across prompt granularities.
  • Figure 4: React F1 by prompt granularity and model.
  • Figure 5: React metric trajectories across prompt granularities.
  • ...and 5 more figures