Table of Contents
Fetching ...

Theoretical analysis of git bisect

Julien Courtiel, Paul Dorbec, Romain Lecoq

TL;DR

This paper studies the algorithm used in git to address the problem of finding a regression in a version control system (VCS), known as git bisect, and proves that in a general setting, git bisect can use an exponentially larger number of queries than an optimal algorithm.

Abstract

In this paper, we consider the problem of finding a regression in a version control system (VCS), such as git. The set of versions is modelled by a Directed Acyclic Graph (DAG) where vertices represent versions of the software, and arcs are the changes between different versions. We assume that somewhere in the DAG, a bug was introduced, which persists in all of its subsequent versions. It is possible to query a vertex to check whether the corresponding version carries the bug. Given a DAG and a bugged vertex, the Regression Search Problem consists in finding the first vertex containing the bug in a minimum number of queries in the worst-case scenario. This problem is known to be NP-complete. We study the algorithm used in git to address this problem, known as git bisect. We prove that in a general setting, git bisect can use an exponentially larger number of queries than an optimal algorithm. We also consider the restriction where all vertices have indegree at most 2 (i.e. where merges are made between at most two branches at a time in the VCS), and prove that in this case, git bisect is a $\frac{1}{\log_2(3/2)}$-approximation algorithm, and that this bound is tight. We also provide a better approximation algorithm for this case. Finally, we give an alternative proof of the NP-completeness of the Regression Search Problem, via a variation with bounded indegree.

Theoretical analysis of git bisect

TL;DR

This paper studies the algorithm used in git to address the problem of finding a regression in a version control system (VCS), known as git bisect, and proves that in a general setting, git bisect can use an exponentially larger number of queries than an optimal algorithm.

Abstract

In this paper, we consider the problem of finding a regression in a version control system (VCS), such as git. The set of versions is modelled by a Directed Acyclic Graph (DAG) where vertices represent versions of the software, and arcs are the changes between different versions. We assume that somewhere in the DAG, a bug was introduced, which persists in all of its subsequent versions. It is possible to query a vertex to check whether the corresponding version carries the bug. Given a DAG and a bugged vertex, the Regression Search Problem consists in finding the first vertex containing the bug in a minimum number of queries in the worst-case scenario. This problem is known to be NP-complete. We study the algorithm used in git to address this problem, known as git bisect. We prove that in a general setting, git bisect can use an exponentially larger number of queries than an optimal algorithm. We also consider the restriction where all vertices have indegree at most 2 (i.e. where merges are made between at most two branches at a time in the VCS), and prove that in this case, git bisect is a -approximation algorithm, and that this bound is tight. We also provide a better approximation algorithm for this case. Finally, we give an alternative proof of the NP-completeness of the Regression Search Problem, via a variation with bounded indegree.
Paper Structure (18 sections, 20 theorems, 24 equations, 21 figures, 1 table, 2 algorithms)

This paper contains 18 sections, 20 theorems, 24 equations, 21 figures, 1 table, 2 algorithms.

Key Result

Proposition 2

For any DAG $D$ where the marked bugged vertex has $n$ ancestors, an optimal strategy that finds the faulty commit uses at least $\lceil \log_{2}(n)\rceil$ queries, and at most $n-1$ queries.

Figures (21)

  • Figure 1: An example of a DAG. The bugged vertices are coloured. The strikeout vertex ($\boldsymbol{21}$) is the marked vertex, known to be bugged. The crossed vertex ($\boldsymbol{5}$) is the faulty commit.
  • Figure 2: Left. A directed path on $5$ vertices. Right. A possible strategy for the Regression Search Problem on the path on $5$ vertices.
  • Figure 3: An octopus of size $6$.
  • Figure 4: The notation $a/b$ along each vertex indicates that $a$ is the number of ancestors of the vertex, and $b$ is the number of non-ancestors. The score (see Definition \ref{['def:score']}) is displayed in black.
  • Figure 5: The git bisect strategy corresponding to the graph of Figure \ref{['fig:score']}. In case of score equality, the convention we choose consists in querying the vertex with the smallest label.
  • ...and 16 more figures

Theorems & Definitions (53)

  • Definition 1
  • Proposition 2
  • proof
  • Definition 3: Score
  • Definition 4: Comb addition
  • Theorem 5
  • proof : Detailed proof
  • Claim 5.1
  • Claim 5.2
  • Claim 5.3
  • ...and 43 more