Table of Contents
Fetching ...

Specification Vibing for Automated Program Repair

Taohong Zhu, Lucas C. Cordeiro, Mustafa A. Mustafa, Youcheng Sun

TL;DR

VibeRepair introduces a specification-centric approach to automated program repair that translates buggy code into a structured behavior specification, repairs the specification to realign intended behavior, and then generates patched code from the repaired spec. By incorporating an optional reasoning component with access to historical bug–fix evidence, VibeRepair improves repair accuracy while keeping the patch exploration space small. Evaluations on Defects4J and Real-World Bugs show VibeRepair consistently outperforms state-of-the-art baselines across multiple LLM backbones, with notable gains in correct fixes and a reduced patch space. The framework demonstrates strong generalizability and is open-sourced, enabling future research to build on explicit behavioral intent as a core APR signal.

Abstract

Large language model (LLM)-driven automated program repair (APR) has advanced rapidly, but most methods remain code-centric: they directly rewrite source code and thereby risk hallucinated, behaviorally inconsistent fixes. This limitation suggests the need for an alternative repair paradigm that relies on a representation more accessible to LLMs than raw code, enabling more accurate understanding, analysis, and alignment during repair. To address this gap, we propose VibeRepair, a specification-centric APR technique that treats repair as behavior-specification repair rather than ad-hoc code editing. VibeRepair first translates buggy code into a structured behavior specification that captures the program's intended runtime behavior, then infers and repairs specification misalignments, and finally synthesizes code strictly guided by the corrected behavior specification. An on-demand reasoning component enriches hard cases with program analysis and historical bug-fix evidence while controlling cost. Across Defects4J and real-world benchmarks and multiple LLMs, VibeRepair demonstrates consistently strong repair effectiveness with a significantly smaller patch space. On Defects4J v1.2, VibeRepair correctly repairs 174 bugs, exceeding the strongest state-of-the-art baseline by 28 bugs, which corresponds to a 19% improvement. On Defects4J v2.0, it repairs 178 bugs, outperforming prior approaches by 33 bugs, representing a 23% improvement. Evaluations on real-world benchmarks collected after the training period of selected LLMs further confirm its effectiveness and generalizability. By centering repair on explicit behavioral intent, VibeRepair reframes APR for the era of "vibe" coding: make the behavior sing, and the code will follow.

Specification Vibing for Automated Program Repair

TL;DR

VibeRepair introduces a specification-centric approach to automated program repair that translates buggy code into a structured behavior specification, repairs the specification to realign intended behavior, and then generates patched code from the repaired spec. By incorporating an optional reasoning component with access to historical bug–fix evidence, VibeRepair improves repair accuracy while keeping the patch exploration space small. Evaluations on Defects4J and Real-World Bugs show VibeRepair consistently outperforms state-of-the-art baselines across multiple LLM backbones, with notable gains in correct fixes and a reduced patch space. The framework demonstrates strong generalizability and is open-sourced, enabling future research to build on explicit behavioral intent as a core APR signal.

Abstract

Large language model (LLM)-driven automated program repair (APR) has advanced rapidly, but most methods remain code-centric: they directly rewrite source code and thereby risk hallucinated, behaviorally inconsistent fixes. This limitation suggests the need for an alternative repair paradigm that relies on a representation more accessible to LLMs than raw code, enabling more accurate understanding, analysis, and alignment during repair. To address this gap, we propose VibeRepair, a specification-centric APR technique that treats repair as behavior-specification repair rather than ad-hoc code editing. VibeRepair first translates buggy code into a structured behavior specification that captures the program's intended runtime behavior, then infers and repairs specification misalignments, and finally synthesizes code strictly guided by the corrected behavior specification. An on-demand reasoning component enriches hard cases with program analysis and historical bug-fix evidence while controlling cost. Across Defects4J and real-world benchmarks and multiple LLMs, VibeRepair demonstrates consistently strong repair effectiveness with a significantly smaller patch space. On Defects4J v1.2, VibeRepair correctly repairs 174 bugs, exceeding the strongest state-of-the-art baseline by 28 bugs, which corresponds to a 19% improvement. On Defects4J v2.0, it repairs 178 bugs, outperforming prior approaches by 33 bugs, representing a 23% improvement. Evaluations on real-world benchmarks collected after the training period of selected LLMs further confirm its effectiveness and generalizability. By centering repair on explicit behavioral intent, VibeRepair reframes APR for the era of "vibe" coding: make the behavior sing, and the code will follow.
Paper Structure (24 sections, 8 figures, 6 tables)

This paper contains 24 sections, 8 figures, 6 tables.

Figures (8)

  • Figure 1: Buggy code and failed LLM-generated patch for Cli-20 (Defects4J). Left:flatten unconditionally splits --x=y at '=' and appends both parts, even when --x is not a recognized option, which can break stopAtNonOption handling for illegal options. Right: the LLM patch tweaks string handling but retains the same unconditional split, leaving the root cause unchanged.
  • Figure 2: Specification-centric repair of Cli-20 using VibeRepair. VibeRepair translates the buggy flatten code (1) into an initial behavior specification (2), infers intended behavior (3), and produces a fixed specification used to generate a patch (rather than directly editing the code; 4). When the patch fails validation, the failing test/error message (5) is used to refine the intended behavior (6) and revise the fixed specification (7). A second patch generated from the revised specification (8) passes all tests (red ✗ / green ✓).
  • Figure 3: Overview of VibeRepair. VibeRepair proceeds in three phases: 1) Transformation: an LLM translates buggy code into a structured, flawed behavior specification that follows a predefined template and combines it with failing tests to form the repair input. 2) Repair: the LLM corrects specification misalignments to produce a fixed specification, optionally using an on-demand reasoning component that provides additional repair supporting information. 3) Generation: the LLM synthesises candidate code from the fixed specification and validates it against the test suite; when validation fails, failing test information is fed back to the repair phase to refine the specification and repeat repair and generation until all tests pass.
  • Figure 4: Transformation phase: given the buggy code, the model fills a specification template to produce an initial natural-language behavior specification that inherits the defect.
  • Figure 5: Fixing a flawed specification in the repair phase. Given the flawed specification information, which consists of a flawed specification and the corresponding failing test cases, the model infers the intended behavior, identifies the root cause and repair direction, and outputs a revised specification.
  • ...and 3 more figures