Table of Contents
Fetching ...

Phase transition in the computational complexity of the shortest common superstring and genome assembly

L. A. Fernandez, V. Martin-Mayor, D. Yllanes

TL;DR

This work demonstrates the existence of a phase transition in the computational complexity of the problem and shows that practical instances always fall in the "easy" phase (solvable by polynomial-time algorithms), and proposes a Markov-chain Monte Carlo method that outperforms common deterministic algorithms in the hard regime.

Abstract

Genome assembly, the process of reconstructing a long genetic sequence by aligning and merging short fragments, or reads, is known to be NP-hard, either as a version of the shortest common superstring problem or in a Hamiltonian-cycle formulation. That is, the computing time is believed to grow exponentially with the the problem size in the worst case. Despite this fact, high-throughput technologies and modern algorithms currently allow bioinformaticians to handle datasets of billions of reads. Using methods from statistical mechanics, we address this conundrum by demonstrating the existence of a phase transition in the computational complexity of the problem and showing that practical instances always fall in the 'easy' phase (solvable by polynomial-time algorithms). In addition, we propose a Markov-chain Monte Carlo method that outperforms common deterministic algorithms in the hard regime.

Phase transition in the computational complexity of the shortest common superstring and genome assembly

TL;DR

This work demonstrates the existence of a phase transition in the computational complexity of the problem and shows that practical instances always fall in the "easy" phase (solvable by polynomial-time algorithms), and proposes a Markov-chain Monte Carlo method that outperforms common deterministic algorithms in the hard regime.

Abstract

Genome assembly, the process of reconstructing a long genetic sequence by aligning and merging short fragments, or reads, is known to be NP-hard, either as a version of the shortest common superstring problem or in a Hamiltonian-cycle formulation. That is, the computing time is believed to grow exponentially with the the problem size in the worst case. Despite this fact, high-throughput technologies and modern algorithms currently allow bioinformaticians to handle datasets of billions of reads. Using methods from statistical mechanics, we address this conundrum by demonstrating the existence of a phase transition in the computational complexity of the problem and showing that practical instances always fall in the 'easy' phase (solvable by polynomial-time algorithms). In addition, we propose a Markov-chain Monte Carlo method that outperforms common deterministic algorithms in the hard regime.
Paper Structure (9 sections, 6 equations, 5 figures)

This paper contains 9 sections, 6 equations, 5 figures.

Figures (5)

  • Figure 1: Performance of common algorithms for the shortest common superstring problem.Top: Probability of finding a successful solution (see text) using a greedy algorithm as a function of the coverage $W$, eq. (\ref{['eq:W']}), for several values of the number of fragments (reads) $N_{\text{frag}}$ and for fragment length $\ell_{\text{frag}}\xspace=100$. For large coverage values, the algorithm always succeeds. Bottom: In terms of the correct scaling variable $x$, eq. (\ref{['eq:x']}), based on the ratio between the average maximum distance between fragments and $\ell_{\text{frag}}$, the $p_{\text{success}}$ curves for different $N_{\text{frag}}$ cross, which we interpret as the onset of a phase transition at some critical $x_{\text{c}}$. The value of $x_{\text{c}}$ is algorithm dependent, but the qualitative behaviour is the same for more sophisticated methods. As a demonstration, we also show the results using Velvet, which employs an algorithm based on de Bruijn graphs.
  • Figure 2: Location of the critical point. The critical point can be determined by looking for the value of $\langle x\rangle$ where fluctuations are largest. We plot the correlation coefficient between the scaling variable $x$ and the success probability for single realisations of the $N_{\text{frag}}$ reads. The absolute value of $r$ has a maximum at the critical point $x_{\text{c}}$.
  • Figure 3: The success probability goes to zero in the hard phase. If the behaviour shown in Fig. \ref{['fig:gloton-velvet']} corresponds to a phase transition, $p_{\text{success}}$ should tend to zero as $N_{\text{frag}}\xspace\to\infty$ for $\langle x\rangle < x_{\text{c}}\xspace$. This figure shows that indeed $p_{\text{success}}$ decays at least as fast as power law in $1/N_{\text{frag}}\xspace$ in the hard phase, while in the easy phase our results are already compatible with $p_{\text{success}}\xspace=1$ for finite sizes.
  • Figure 4: Varying the fragment length hardly makes any difference. Our previous results have always considered $\ell_{\text{frag}}\xspace=100$. It turns out that the dependence in this parameter is residual and rapidly vanishes as $\ell_{\text{frag}}$ grows, according to eq. (\ref{['eq:lf']}). That is, the curves of $p_{\text{success}}$ as a function of $\langle x\rangle$ can be collapsed if we subtract the scaling correction caused by finite $\ell_{\text{frag}}$. We also show that the results for a natural genome (namely that of the swinepox virus) are indistinguishable from those for random sequences. In this case, since $L$ is fixed, we have a single value of $p_{\text{success}}$ for each $\ell_{\text{frag}}$, all of which fall on the rescaled curve.
  • Figure 5: A Monte Carlo algorithm for the hard cases. We propose a segment-swap Markov-chain Monte Carlo algorithm (sketched in the left panel) that outperforms common deterministic methods in the hard regime (see right panel). We represent the permutation of the reads as an ordered sequence of fragments (the arrows indicate the sense in which the sequence should be toured). The elementary move of the algorithm is composed of the following three steps. First, chose randomly three independent pairs of consecutive fragments (depicted with grey circles in the plot), $\ldots \to \alpha_1 \to \alpha_2\to\ldots\to \beta_1 \to \beta_2\to\ldots\to \gamma_1 \to \gamma_2\to\ldots$. Mind the pair ordering: when one tours the circular sequence starting from fragment $\alpha_1$, fragments $\gamma_1$ and $\gamma_2$ are not found earlier than $\beta_1$ and $\beta_2$ (the choices $\alpha_2=\beta_1$ and/or $\beta_2=\gamma_1$ are acceptable). Second, consider the rewired sequence $\ldots \to \alpha_1 \to \beta_2\to\ldots\to \gamma_1 \to \alpha_2\to\ldots\to \beta_1 \to \gamma_2\to\ldots$ (there is an unacceptable reconnection ---indicated by NO in the figure--- that would split the sequence into three disconnected cycles). Third step: If the new cycle is not longer than the original one, the segment swap is accepted. As we show in the right panel, segment swap is more effective than other algorithms for $-1 < x < 0.5$. For $x<0$ the SCS problem no longer corresponds to a full assembly, since there are gaps between reads. The segment-swap algorithm, however, always finds superstrings that satisfy our success criterion ($\ell \leq \ell_\text{ordered}$).