Table of Contents
Fetching ...

RepairBench: Leaderboard of Frontier Models for Program Repair

André Silva, Martin Monperrus

TL;DR

RepairBench introduces an execution-based leaderboard for AI-driven program repair, focusing on frontier models to enable frequent, standardized evaluation on real-world bugs. It combines Defects4J v2 and GitBug-Java benchmarks with a uniform zero-shot prompt and a strict cost framework, ranking models primarily by Plausible@1 and secondarily by AST Match@1. The results illustrate a cost-versus-performance trade-off among leading frontier models, and the discussion highlights benchmark leakage concerns and mitigation strategies. Overall, RepairBench provides a reusable, longitudinal framework to track progress in automated program repair as frontier models evolve, with publicly available prompts, patches, and code.

Abstract

AI-driven program repair uses AI models to repair buggy software by producing patches. Rapid advancements in AI surely impact state-of-the-art performance of program repair. Yet, grasping this progress requires frequent and standardized evaluations. We propose RepairBench, a novel leaderboard for AI-driven program repair. The key characteristics of RepairBench are: 1) it is execution-based: all patches are compiled and executed against a test suite, 2) it assesses frontier models in a frequent and standardized way. RepairBench leverages two high-quality benchmarks, Defects4J and GitBug-Java, to evaluate frontier models against real-world program repair tasks. We publicly release the evaluation framework of RepairBench. We will update the leaderboard as new frontier models are released.

RepairBench: Leaderboard of Frontier Models for Program Repair

TL;DR

RepairBench introduces an execution-based leaderboard for AI-driven program repair, focusing on frontier models to enable frequent, standardized evaluation on real-world bugs. It combines Defects4J v2 and GitBug-Java benchmarks with a uniform zero-shot prompt and a strict cost framework, ranking models primarily by Plausible@1 and secondarily by AST Match@1. The results illustrate a cost-versus-performance trade-off among leading frontier models, and the discussion highlights benchmark leakage concerns and mitigation strategies. Overall, RepairBench provides a reusable, longitudinal framework to track progress in automated program repair as frontier models evolve, with publicly available prompts, patches, and code.

Abstract

AI-driven program repair uses AI models to repair buggy software by producing patches. Rapid advancements in AI surely impact state-of-the-art performance of program repair. Yet, grasping this progress requires frequent and standardized evaluations. We propose RepairBench, a novel leaderboard for AI-driven program repair. The key characteristics of RepairBench are: 1) it is execution-based: all patches are compiled and executed against a test suite, 2) it assesses frontier models in a frequent and standardized way. RepairBench leverages two high-quality benchmarks, Defects4J and GitBug-Java, to evaluate frontier models against real-world program repair tasks. We publicly release the evaluation framework of RepairBench. We will update the leaderboard as new frontier models are released.
Paper Structure (15 sections, 2 figures, 1 table)

This paper contains 15 sections, 2 figures, 1 table.

Figures (2)

  • Figure 1: Prompt for bug Csv-6 of Defects4J. The test case and runtime information guide frontier models in generating patches.
  • Figure 2: Performance (Plausible@1) in function of the total cost (in USD). The most performant models are also the most expensive ones.