Table of Contents
Fetching ...

GitGoodBench: A Novel Benchmark For Evaluating Agentic Performance On Git

Tobias Lindenbauer, Egor Bogomolov, Yaroslav Zharov

TL;DR

GitGoodBench introduces an end-to-end benchmark for evaluating agentic performance on Git by spanning three core VCS tasks: merge-conflict resolution, history rewriting, and history generation from disorganized changes. It provides three dataset variants (lite, full, and training) drawn from permissive Python, Java, and Kotlin repositories, totaling 900 evaluation samples and 17,469 training samples. Baseline experiments using GPT-4o with custom Git tools achieve 21.11% solve rate on the lite set, highlighting the difficulty of integrating Git workflows into SE AI agents. The work emphasizes the need for agents to reason about Git artifacts and to effectively utilize VCS tooling, proposing a foundation for training trajectories and richer end-to-end SE agent capabilities. The benchmark also identifies limitations and future directions, such as expanding tooling, including diagnostic workflows like bisect, and mitigating evaluation biases.

Abstract

Benchmarks for Software Engineering (SE) AI agents, most notably SWE-bench, have catalyzed progress in programming capabilities of AI agents. However, they overlook critical developer workflows such as Version Control System (VCS) operations. To address this issue, we present GitGoodBench, a novel benchmark for evaluating AI agent performance on VCS tasks. GitGoodBench covers three core Git scenarios extracted from permissive open-source Python, Java, and Kotlin repositories. Our benchmark provides three datasets: a comprehensive evaluation suite (900 samples), a rapid prototyping version (120 samples), and a training corpus (17,469 samples). We establish baseline performance on the prototyping version of our benchmark using GPT-4o equipped with custom tools, achieving a 21.11% solve rate overall. We expect GitGoodBench to serve as a crucial stepping stone toward truly comprehensive SE agents that go beyond mere programming.

GitGoodBench: A Novel Benchmark For Evaluating Agentic Performance On Git

TL;DR

GitGoodBench introduces an end-to-end benchmark for evaluating agentic performance on Git by spanning three core VCS tasks: merge-conflict resolution, history rewriting, and history generation from disorganized changes. It provides three dataset variants (lite, full, and training) drawn from permissive Python, Java, and Kotlin repositories, totaling 900 evaluation samples and 17,469 training samples. Baseline experiments using GPT-4o with custom Git tools achieve 21.11% solve rate on the lite set, highlighting the difficulty of integrating Git workflows into SE AI agents. The work emphasizes the need for agents to reason about Git artifacts and to effectively utilize VCS tooling, proposing a foundation for training trajectories and richer end-to-end SE agent capabilities. The benchmark also identifies limitations and future directions, such as expanding tooling, including diagnostic workflows like bisect, and mitigating evaluation biases.

Abstract

Benchmarks for Software Engineering (SE) AI agents, most notably SWE-bench, have catalyzed progress in programming capabilities of AI agents. However, they overlook critical developer workflows such as Version Control System (VCS) operations. To address this issue, we present GitGoodBench, a novel benchmark for evaluating AI agent performance on VCS tasks. GitGoodBench covers three core Git scenarios extracted from permissive open-source Python, Java, and Kotlin repositories. Our benchmark provides three datasets: a comprehensive evaluation suite (900 samples), a rapid prototyping version (120 samples), and a training corpus (17,469 samples). We establish baseline performance on the prototyping version of our benchmark using GPT-4o equipped with custom tools, achieving a 21.11% solve rate overall. We expect GitGoodBench to serve as a crucial stepping stone toward truly comprehensive SE agents that go beyond mere programming.

Paper Structure

This paper contains 33 sections, 1 equation, 11 figures, 11 tables.

Figures (11)

  • Figure 1: The three Git scenarios supported by GitGoodBench. Each scenario benchmarks a typical Git use-case and unique aspect of version control.
  • Figure 2: Our mcr prompt.
  • Figure 3: Our mcr prompt continued.
  • Figure 4: Our mcr prompt continued.
  • Figure 5: Our ir prompt.
  • ...and 6 more figures