Table of Contents
Fetching ...

From de Bruijn graphs to variation graphs-relationships between pangenome models

Adam Cicherski, Norbert Dojer

TL;DR

An axiomatization of the desirable properties of a graph representation of a collection of strings is proposed and the relationship between variation graphs satisfying these criteria and de Bruijn graphs is shown.

Abstract

Pangenomes serve as a framework for joint analysis of genomes of related organisms. Several pangenome models were proposed, offering different functionalities, applications provided by available tools, their efficiency etc. Among them, two graph-based models are particularly widely used: variation graphs and de Bruijn graphs. In the current paper we propose an axiomatization of the desirable properties of a graph representation of a collection of strings. We show the relationship between variation graphs satisfying these criteria and de Bruijn graphs. This relationship can be used to efficiently build a variation graph representing a given set of genomes, transfer annotations between both models, compare the results of analyzes based on each model etc.

From de Bruijn graphs to variation graphs-relationships between pangenome models

TL;DR

An axiomatization of the desirable properties of a graph representation of a collection of strings is proposed and the relationship between variation graphs satisfying these criteria and de Bruijn graphs is shown.

Abstract

Pangenomes serve as a framework for joint analysis of genomes of related organisms. Several pangenome models were proposed, offering different functionalities, applications provided by available tools, their efficiency etc. Among them, two graph-based models are particularly widely used: variation graphs and de Bruijn graphs. In the current paper we propose an axiomatization of the desirable properties of a graph representation of a collection of strings. We show the relationship between variation graphs satisfying these criteria and de Bruijn graphs. This relationship can be used to efficiently build a variation graph representing a given set of genomes, transfer annotations between both models, compare the results of analyzes based on each model etc.

Paper Structure

This paper contains 14 sections, 8 theorems, 1 equation, 3 figures.

Key Result

proposition thmcounterproposition

Given set of strings $\mathcal{S}=\{S_1,\ldots,S_n\}$ such that $|S_i|\ge k$ for every $i\in\{1,\ldots,n\}$, there is a unique up to isomorphism representation of $\mathcal{S}$ by a de Bruijn graph of length $k$. Moreover this representation is $k$-complete and $k$-faithfull.

Figures (3)

  • Figure 1: Two examples of a $3$-complete and $3$-faithful representations of a set of strings $\{GTGT, TTGT, ATGG, ATGA, CTGG, CTGA \}$ - de Bruijn graph on the left and variation graph on the right. Different colors of edges correspond to different paths in $\pi$. Note that in VG the common $2$-mer $TG$ from sequences $ATGG$ (red), $ATGA$ (blue), $CTGG$ (black) and $CTGA$ (purple) can be represented by a common vertex, because its occurrences on blue and red path can be extended to the common path labeled by $3$-mer $ATG$, the occurrences on red and black path can be extended to a path labeled by $TGG$ and the occurrences on black and purple can be extended to the path labeled by $CTG$. On the other hand, the occurrences of $2$-mer $TG$ in sequences $GTGT$ (yellow) and $TTGT$ (pink) cannot be represented by this vertex, since its occurrences in yellow and pink paths are not extendable to a path labeled by a common $3$-mer with any other occurrence on the rest of the paths.
  • Figure 2: Steps of the graph transformation algorithm for the graph representation of the set of strings $S= \{ACTGA, ACTGT, ACTT, CCTT, CCTA\}$. From top: input de Bruijn graph for $k=3$ and transition graphs resulting from Split, Merge and Collapse transformations, respectively. Solid lines represent V-edges, dashed lines represent B-edges.
  • Figure 3: Example of Collapse transformation applied to a $B$-edge, in which $k-1$ vertices following this edge overlap with $k-1$ vertices preceding it. $S=\{ACTGACTGACT\}$, $k=7$, labels above edges indicate their order in the path. Top: $B$-edge 10 is followed by $k-1$ vertices connected by edges 11-15 and preceded by $k-1$ vertices connected by edges 5-9, the overlap forms a subpath with vertices $A$ and $C$ connected by edge labeled with 5 before the $B$-edge and 15 after the $B$-edge. Bottom: after the Collapse operation the overlapping subpath is merged with two other subpaths with vertices $A$ and $C$: preceding the $B$-edge and following it.

Theorems & Definitions (15)

  • proposition thmcounterproposition
  • proof
  • lemma thmcounterlemma
  • proof
  • theorem thmcountertheorem
  • proof
  • lemma thmcounterlemma
  • proof
  • lemma thmcounterlemma
  • proof
  • ...and 5 more