Table of Contents
Fetching ...

Pangenome-guided sequence assembly via binary optimisation

Josh Cudby, James Bonfield, Chenxi Zhou, Richard Durbin, Sergii Strelchuk

TL;DR

This work introduces pangenome-guided sequence assembly, casting graph traversal through a pangenome as a QUBO-based optimisation problem to resolve short-read data in complex regions without reference bias. The authors provide a five-stage workflow (pangenome creation, read mapping, copy-number annotation, path optimisation, solution processing) and explore classical and quantum solvers, including quantum annealing and QAOA, to solve the Tangle Resolution formulations. Empirical results on simulated data show substantially fewer contigs than de novo assemblers, with a modest trade-off in accuracy that can be mitigated by optional realignment or consensus steps; classical solvers are competitive with state-of-the-art methods, while early quantum demonstrations indicate potential as hardware scales. The work also introduces tools for realistic synthetic pangenomes and read-to-graph mapping (kmer2node), contributing broadly to pangenomics tooling and motivating further quantum-ready algorithm development for genome assembly.

Abstract

De novo genome assembly is challenging in highly repetitive regions; however, reference-guided assemblers often suffer from bias. We propose a framework for pangenome-guided sequence assembly, which can resolve short-read data in complex regions without bias towards a single reference genome. Our primary contribution is to frame the assembly as a graph traversal optimisation problem, which can be implemented classically or on a quantum computer. The workflow involves first annotating pangenome graphs with estimated copy numbers for each node, then finding a path on the graph that best explains those copy numbers. On simulated data, our approach significantly reduces the number of contigs compared to de novo assemblers. While they introduce a small increase in inaccuracies, such as false joins, our optimisation-based methods are competitive with current exhaustive search techniques. They are also designed to scale more efficiently as the problem size grows and will run effectively on future quantum computers; a small experiment on a real quantum device showcases this behaviour. Moreover, they are more resilient to noise in copy number estimation inherent in short-read-based assembly. We also develop novel tools for creating realistic synthetic pangenomes, aligning reads to pangenomes and for evaluating assembly quality.

Pangenome-guided sequence assembly via binary optimisation

TL;DR

This work introduces pangenome-guided sequence assembly, casting graph traversal through a pangenome as a QUBO-based optimisation problem to resolve short-read data in complex regions without reference bias. The authors provide a five-stage workflow (pangenome creation, read mapping, copy-number annotation, path optimisation, solution processing) and explore classical and quantum solvers, including quantum annealing and QAOA, to solve the Tangle Resolution formulations. Empirical results on simulated data show substantially fewer contigs than de novo assemblers, with a modest trade-off in accuracy that can be mitigated by optional realignment or consensus steps; classical solvers are competitive with state-of-the-art methods, while early quantum demonstrations indicate potential as hardware scales. The work also introduces tools for realistic synthetic pangenomes and read-to-graph mapping (kmer2node), contributing broadly to pangenomics tooling and motivating further quantum-ready algorithm development for genome assembly.

Abstract

De novo genome assembly is challenging in highly repetitive regions; however, reference-guided assemblers often suffer from bias. We propose a framework for pangenome-guided sequence assembly, which can resolve short-read data in complex regions without bias towards a single reference genome. Our primary contribution is to frame the assembly as a graph traversal optimisation problem, which can be implemented classically or on a quantum computer. The workflow involves first annotating pangenome graphs with estimated copy numbers for each node, then finding a path on the graph that best explains those copy numbers. On simulated data, our approach significantly reduces the number of contigs compared to de novo assemblers. While they introduce a small increase in inaccuracies, such as false joins, our optimisation-based methods are competitive with current exhaustive search techniques. They are also designed to scale more efficiently as the problem size grows and will run effectively on future quantum computers; a small experiment on a real quantum device showcases this behaviour. Moreover, they are more resilient to noise in copy number estimation inherent in short-read-based assembly. We also develop novel tools for creating realistic synthetic pangenomes, aligning reads to pangenomes and for evaluating assembly quality.

Paper Structure

This paper contains 24 sections, 5 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: A sketch of the pangenome-guided sequence assembly procedure. (1) Problem creation consists of two steps. A pangenome and a new individual genome are synthesised, and the genome is shredded to simulate shotgun sequencing. (2) Read mapping involves aligning the short reads to the pangenome using 1 of 3 software tools and tagging the nodes with the observed kmer counts. (3) Copy number estimation is performed using pathfinder. (4) Optimal path finding depends on the choice of solver. If using pathfinder, input the annotated graph directly. Otherwise, construct the QUBO matrix from the graph, and input that into the chosen QUBO solver. (5) Solution processing starts with an optional re-alignment step to improve solution quality, before evaluating solution quality with a variety of metrics.
  • Figure 2: (a), (b) Radar charts comparing the performance of each combination of annotation strategy: GraphAligner, kmer2node and minigraph; and classical solver: Gurobi, MQLib and pathfinder. The seven axes plotted are the evaluation criteria discussed in \ref{['sec:classical_post']}. For each, the number in brackets corresponds to the outermost point on that axis. For the lower half of each plot, further out along the axis corresponds to fewer contigs, breaks and so on. (c) Violin plots showing the per-instance performance of Gurobi, MQLib and pathfinder.
  • Figure 3: (a), (b) Radar charts comparing the performance of each combination of annotation strategy: GraphAligner, kmer2node and minigraph; with quantum annealing solver D-Wave or classical solver pathfinder. The seven axes plotted are the evaluation criteria discussed in \ref{['sec:classical_post']}. For each, the number in brackets corresponds to the outermost point on that axis. For the lower half of each plot, further out along the axis corresponds to fewer contigs, breaks and so on. (c) Violin plots showing the per-instance performance of D-Wave and pathfinder.
  • Figure 4: Plots showing the performance of a 10-qubit QAOA simulation for Oriented Tangle Resolution.
  • Figure 5: Plots showing the performance of a 10-qubit QAOA simulation for Oriented Tangle Resolution.
  • ...and 2 more figures