Time Travel: LLM-Assisted Semantic Behavior Localization with Git Bisect
Yujing Wang, Weize Hong
TL;DR
The paper tackles semantic fault localization in Git bisect under real-world noise by integrating LLMs into the commit-by-commit traversal and employing structured chain-of-thought prompts to justify decisions. A two-stage fine-tuning workflow with QLoRA (global semantic fine-tuning plus repository-specific fine-tuning) and weak supervision enables robust commit-level reasoning over diffs, achieving a reported increase in bisect success from $74.2\%$ to $80.6\%$ and up to a 2x reduction in average bisect time. The study includes model selection benchmarks, a formal prompting framework with a JSON output schema, and a 32-run end-to-end evaluation across three repositories, supported by statistical significance ($p < 0.01$) and a developer-time study showing substantial productivity gains. The work demonstrates practical improvements for automated debugging pipelines in forked/downstream development contexts and outlines future directions for domain-adaptive prompting and larger-scale dataset curation. All mathematical notation is presented with proper delimiters, e.g., $O(\log n)$ and $\{0,1\}$.
Abstract
We present a novel framework that integrates Large Language Models (LLMs) into the Git bisect process for semantic fault localization. Traditional bisect assumes deterministic predicates and binary failure states assumptions often violated in modern software development due to flaky tests, nonmonotonic regressions, and semantic divergence from upstream repositories. Our system augments bisect traversal with structured chain of thought reasoning, enabling commit by commit analysis under noisy conditions. We evaluate multiple open source and proprietary LLMs for their suitability and fine tune DeepSeekCoderV2 using QLoRA on a curated dataset of semantically labeled diffs. We adopt a weak supervision workflow to reduce annotation overhead, incorporating human in the loop corrections and self consistency filtering. Experiments across multiple open source projects show a 6.4 point absolute gain in success rate from 74.2 to 80.6 percent, leading to significantly fewer failed traversals and by experiment up to 2x reduction in average bisect time. We conclude with discussions on temporal reasoning, prompt design, and finetuning strategies tailored for commit level behavior analysis.
