Table of Contents
Fetching ...

VIBEPASS: Can Vibe Coders Really Pass the Vibe Check?

Srijan Bansal, Jiao Fangkai, Yilun Zhou, Austin Xu, Shafiq Joty, Semih Yavuz

Abstract

As Large Language Models shift the programming toward human-guided ''vibe coding'', agentic coding tools increasingly rely on models to self-diagnose and repair their own subtle faults -- a capability central to autonomous software engineering yet never systematically evaluated. We present \name{}, the first empirical decomposition that jointly evaluates two coupled tasks: \emph{Fault-Triggering Test Generation (FT-Test)} constructing a discriminative witness that exposes a latent bug, and \emph{Fault-targeted Program Repair (FPR)}, repairing it under varying diagnostic conditions. \name{} pairs competitive programming problems with LLM-generated solutions that pass partial test suites but fail on semantic edge cases, enabling controlled identification of where the diagnostic chain breaks down. Evaluating 12 frontier LLMs, we find that fault-targeted reasoning does not scale with general coding ability. Models produce syntactically valid test inputs at near-ceiling rates yet collapse on discriminative generation, with fault hypothesis generation -- not output validation -- as the dominant bottleneck. Test-guided repair reveals a complementary insight: when self-generated tests successfully witness a fault, the resulting repair matches or outperforms repair guided by externally provided tests, but tests that fail to witness the fault actively degrade repair below unguided baselines. Together, these results reframe the challenge of autonomous debugging: the binding bottleneck is not code synthesis or test validity but fault-target reasoning, a capability that remains deficient across all frontier models. As Large Language Models shift the programming toward human-guided ''vibe coding'', agentic coding tools increasingly rely on models to self-diagnose and repair their own subtle faults -- a capability central to autonomous software engineering yet never systematically evaluated.

VIBEPASS: Can Vibe Coders Really Pass the Vibe Check?

Abstract

As Large Language Models shift the programming toward human-guided ''vibe coding'', agentic coding tools increasingly rely on models to self-diagnose and repair their own subtle faults -- a capability central to autonomous software engineering yet never systematically evaluated. We present \name{}, the first empirical decomposition that jointly evaluates two coupled tasks: \emph{Fault-Triggering Test Generation (FT-Test)} constructing a discriminative witness that exposes a latent bug, and \emph{Fault-targeted Program Repair (FPR)}, repairing it under varying diagnostic conditions. \name{} pairs competitive programming problems with LLM-generated solutions that pass partial test suites but fail on semantic edge cases, enabling controlled identification of where the diagnostic chain breaks down. Evaluating 12 frontier LLMs, we find that fault-targeted reasoning does not scale with general coding ability. Models produce syntactically valid test inputs at near-ceiling rates yet collapse on discriminative generation, with fault hypothesis generation -- not output validation -- as the dominant bottleneck. Test-guided repair reveals a complementary insight: when self-generated tests successfully witness a fault, the resulting repair matches or outperforms repair guided by externally provided tests, but tests that fail to witness the fault actively degrade repair below unguided baselines. Together, these results reframe the challenge of autonomous debugging: the binding bottleneck is not code synthesis or test validity but fault-target reasoning, a capability that remains deficient across all frontier models. As Large Language Models shift the programming toward human-guided ''vibe coding'', agentic coding tools increasingly rely on models to self-diagnose and repair their own subtle faults -- a capability central to autonomous software engineering yet never systematically evaluated.
Paper Structure (17 sections, 4 figures, 4 tables)

This paper contains 17 sections, 4 figures, 4 tables.

Figures (4)

  • Figure 1: VibePass evaluates LLM performance across roles requiring fault-targeted reasoning, as typical in practical coding agents. Given a problem description and a buggy solution, the LLM (Judge) first determines whether a bug exists. If a bug is detected, the LLM (Tester) generates a fault-triggering (FT) test, consisting of an input and expected output (FT-Test Bug Discovery). An FT-test is correct if it satisfies three conditions: the input is valid, the buggy solution fails the test, and a silver solution passes it. The FT-test is then used by the LLM (Debugger) to produce a revised solution (Fault-Targeted Program Repair), which must pass the official test suite to be considered a valid fix.
  • Figure 2: Controlled comparison of test feedback mechanisms on valid corner-case intersection. Debugging performance is shown for samples where both external (Task-1) and self-generated (Task-3) tests are valid. Left: Success rates under NoTest (gray), ExtTest (blue), and IntTest (red) across 11 models. Right: Performance difference (IntTest − ExtTest) per model, with red bars favoring self-generated tests and blue bars favoring external tests.
  • Figure 3: Pipeline Stage Correlations and Model Family Performance Across Cumulative Requirements. Using VibePass instances, we evaluate 12 frontier LLMs across a progression of coding-reasoning tasks: input generation (Valid Input), output prediction (Valid-IO), fault-triggering discrimination (FT-Input, FT-IO), and program Repair, with Judge metrics assessing final correctness. [Left] Pearson correlations reveal that fault-triggering metrics (FT-Input--FT-IO, $r=0.988$) and their relationship to Judge performance ($r \geq 0.86$) are the strongest predictors of success, while Valid Input alone weakly predicts downstream results ($r=0.311$). This suggests that the ability to generate fault-revealing tests—rather than mere syntactic validity—is more closely aligned with solving complex bugs. [Right] Cumulative success rates show the largest performance drops at the Valid IO$\to$FT-Input (14.7 pp) and FT-IO$\to$Repair (21.2 pp) transitions, identifying these as the primary reasoning bottlenecks. Family-level trends highlight that while OpenAI models maintain the highest stability (54.3%), Google models struggle with fault-triggering tasks and open-source models underperform significantly in repair (12.1%). VibePass maps the full spectrum from basic test generation to advanced debugging, exposing critical gaps in current model capabilities.
  • Figure 4: Example prompt and corresponding model output for input-validity generation.