AssertFlip: Reproducing Bugs via Inversion of LLM-Generated Passing Tests
Lara Khatib, Noble Saji Mathews, Meiyappan Nagappan
TL;DR
AssertFlip reframes bug reproduction by generating passing tests that exercise buggy behavior and then inverting them to produce bug-revealing tests, addressing a key bottleneck where most bugs lack executable repro tests at report time. The approach uses a structured, multi-stage pipeline (localization, planning, pass-test generation, refinement, inversion, validation, and regeneration) guided by LLMs, and leverages a get_info tool to reduce code hallucinations. Empirical evaluation on SWT-Bench shows AssertFlip delivering the best reported fail-to-pass rate on the Verified subset (43.6%), outpacing prior methods, with robust coverage of modified lines and competitive cost under practical regeneration limits. The work highlights the value of objective-driven generation (focusing on passing tests first) and suggests potential gains from combining diverse systems and integrating coverage signals into validation, offering a practical path toward more reliable automated bug reproduction in real-world debugging workflows.
Abstract
Bug reproduction is critical in the software debugging and repair process, yet the majority of bugs in open-source and industrial settings lack executable tests to reproduce them at the time they are reported, making diagnosis and resolution more difficult and time-consuming. To address this challenge, we introduce AssertFlip, a novel technique for automatically generating Bug Reproducible Tests (BRTs) using large language models (LLMs). Unlike existing methods that attempt direct generation of failing tests, AssertFlip first generates passing tests on the buggy behaviour and then inverts these tests to fail when the bug is present. We hypothesize that LLMs are better at writing passing tests than ones that crash or fail on purpose. Our results show that AssertFlip outperforms all known techniques in the leaderboard of SWT-Bench, a benchmark curated for BRTs. Specifically, AssertFlip achieves a fail-to-pass success rate of 43.6% on the SWT-Bench-Verified subset.
