Table of Contents
Fetching ...

Variations on the Problem of Identifying Spectrum-Preserving String Sets

Sankardeep Chakraborty, Roberto Grossi, Ren Kimura, Giulia Punzi, Kunihiko Sadakane, Wiktor Zuba

TL;DR

Experiments indicate that the minimum necklace cover achieves smaller representations than Eulertigs and comparable compression to the Masked Superstrings approach, while maintaining exactness of the $k$-mer spectrum.

Abstract

In computational genomics, many analyses rely on efficient storage and traversal of $k$-mers, motivating compact representations such as spectrum-preserving string sets (SPSS), which store strings whose $k$-mer spectrum matches that of the input. Existing approaches, including Unitigs, Eulertigs and Matchtigs, model this task as a path cover problem on the deBruijn graph. We extend this framework from paths to branching structures by introducing necklace covers, which combine cycles and tree-like attachments (pendants). We present a greedy algorithm that constructs a necklace cover while guaranteeing, under certain conditions, optimality in the cumulative size of the final representation. Experiments on real genomic datasets indicate that the minimum necklace cover achieves smaller representations than Eulertigs and comparable compression to the Masked Superstrings approach, while maintaining exactness of the $k$-mer spectrum.

Variations on the Problem of Identifying Spectrum-Preserving String Sets

TL;DR

Experiments indicate that the minimum necklace cover achieves smaller representations than Eulertigs and comparable compression to the Masked Superstrings approach, while maintaining exactness of the -mer spectrum.

Abstract

In computational genomics, many analyses rely on efficient storage and traversal of -mers, motivating compact representations such as spectrum-preserving string sets (SPSS), which store strings whose -mer spectrum matches that of the input. Existing approaches, including Unitigs, Eulertigs and Matchtigs, model this task as a path cover problem on the deBruijn graph. We extend this framework from paths to branching structures by introducing necklace covers, which combine cycles and tree-like attachments (pendants). We present a greedy algorithm that constructs a necklace cover while guaranteeing, under certain conditions, optimality in the cumulative size of the final representation. Experiments on real genomic datasets indicate that the minimum necklace cover achieves smaller representations than Eulertigs and comparable compression to the Masked Superstrings approach, while maintaining exactness of the -mer spectrum.
Paper Structure (10 sections, 6 theorems, 6 figures, 1 table, 3 algorithms)

This paper contains 10 sections, 6 theorems, 6 figures, 1 table, 3 algorithms.

Key Result

Lemma 1

The parenthesis representation for a necklace cover of an input set $I$ can be computed in $O(w(I))$ time and space. Let $N_k$ be the number of distinct $k$-mers in $I$ Let $N_C$ be the number of closed necklaces, $N_O$ be the number of open necklaces, and $N_L$ be the number of leaves over all the

Figures (6)

  • Figure 1: Node-centric (left) and edge-centric (right) deBruijn graphs for input string set $I = \{\texttt{T}\texttt{G}\texttt{G}\texttt{A}\texttt{C}\texttt{G}\texttt{G}\texttt{G}\texttt{A}\texttt{C}\texttt{G}\texttt{G}\texttt{C}\texttt{A}\texttt{T}, \texttt{C}\texttt{A}\texttt{G}\texttt{T}\texttt{T}\texttt{C}\texttt{C}, \texttt{C}\texttt{G}\texttt{G}\texttt{T}\texttt{C}\texttt{G}\texttt{T}\texttt{T}, \texttt{G}\texttt{G}\texttt{C}\texttt{A}\texttt{G}\texttt{C}\texttt{T}\}$ and $k=3$. On the left, nodes correspond to $k$-mers, and we have edges connecting $k$-mers that have an overlap of $k-1$ (edge labels are omitted). On the right, the nodes are the $(k-1)$-mers of $I$, and $k$-mers are given by edges: edge $(u,v,c)$ represents $k$-mer $uc$. Note that the number of nodes of the graph on the left is equal to the number of edges of the graph on the right (both equal to 21, the number of distinct $k$-mers of $I$).
  • Figure 2: Closed (left) and open (right) necklaces, with resp. 4 and 3 pendants.
  • Figure 3: Bottom: open necklace with same pendant structure as the subtrees of the root on the top, with parenthesis representation given by $\texttt{ACGT(T(C(G)C)A)A(CT)TA(A(CC)(G)T)G}$
  • Figure 4: Necklace cover for the graph of Figure \ref{['fig:dBG-node-edge']}, formed by two closed necklaces (green and blue) and one open necklace (red).
  • Figure 5: Example where FindNewCycle must be executed on the paths to the right: the two paths corresponding to $\texttt{A}$$\texttt{T}$$\texttt{C}$$\texttt{A}$$\texttt{C}$ and $\texttt{C}$$\texttt{A}$$\texttt{A}$$\texttt{T}$$\texttt{A}$ can be transformed into a closed necklace with base cycle $\texttt{A}$$\texttt{T}$$\texttt{C}$$\texttt{A}$$\texttt{A}$ and pendants $\texttt{C}$$\texttt{A}$$\texttt{C}$, $\texttt{A}$$\texttt{T}$$\texttt{A}$.
  • ...and 1 more figures

Theorems & Definitions (6)

  • Lemma 1
  • Lemma 2
  • Theorem 3
  • Lemma 4: Existence of a new cycle
  • Theorem 5
  • Theorem 6