Table of Contents
Fetching ...

LCPan: efficient variation graph construction using Locally Consistent Parsing

Akmuhammet Ashyralyyev, Zülal Bingöl, Begüm Filiz Öz, Kaiyuan Zhu, Salem Malikic, Uzi Vishkin, S. Cenk Sahinalp, Can Alkan

TL;DR

The paper addresses the need for memory-efficient, scalable genomic string processing beyond traditional k-mer sketches by introducing Locally Consistent Parsing (LCP), which partitions genomes into consistently labeled cores. It presents an iterative, practical implementation (Lcptools) and a variation-graph construction method (LCPan) that leverage LCP cores to enable faster and more memory-efficient analyses. Empirical results show that core counts follow a geometric decay with level i, approximately $O(n/c^i)$ where $c \approx 2.34$, while core lengths and inter-core distances scale as $O(c^i)$, enabling dramatic reductions in representation size. LCPan delivers substantial performance gains over vg (>$10\times$ faster construction and >$13\times$ lower memory) and comparable or better alignment accuracy, demonstrating the practical impact of LCP-based graph construction for large-scale genomics, with open-source implementations available for broader use.

Abstract

Efficient and consistent string processing is critical in the exponentially growing genomic data era. Locally Consistent Parsing (LCP) addresses this need by partitioning an input genome string into short, exactly matching substrings (e.g., "cores"), ensuring consistency across partitions. Labeling the cores of an input string consistently not only provides a compact representation of the input but also enables the reapplication of LCP to refine the cores over multiple iterations, providing a progressively longer and more informative set of substrings for downstream analyses. We present the first iterative implementation of LCP with Lcptools and demonstrate its effectiveness in identifying cores with minimal collisions. Experimental results show that the number of cores at the i^th iteration is O(n/c^i) for c ~ 2.34, while the average length and the average distance between consecutive cores are O(c^i). Compared to the popular sketching techniques, LCP produces significantly fewer cores, enabling a more compact representation and faster analyses. To demonstrate the advantages of LCP in genomic string processing in terms of computation and memory efficiency, we also introduce LCPan, an efficient variation graph constructor. We show that LCPan generates variation graphs >10x faster than vg, while using >13x less memory.

LCPan: efficient variation graph construction using Locally Consistent Parsing

TL;DR

The paper addresses the need for memory-efficient, scalable genomic string processing beyond traditional k-mer sketches by introducing Locally Consistent Parsing (LCP), which partitions genomes into consistently labeled cores. It presents an iterative, practical implementation (Lcptools) and a variation-graph construction method (LCPan) that leverage LCP cores to enable faster and more memory-efficient analyses. Empirical results show that core counts follow a geometric decay with level i, approximately where , while core lengths and inter-core distances scale as , enabling dramatic reductions in representation size. LCPan delivers substantial performance gains over vg (> faster construction and > lower memory) and comparable or better alignment accuracy, demonstrating the practical impact of LCP-based graph construction for large-scale genomics, with open-source implementations available for broader use.

Abstract

Efficient and consistent string processing is critical in the exponentially growing genomic data era. Locally Consistent Parsing (LCP) addresses this need by partitioning an input genome string into short, exactly matching substrings (e.g., "cores"), ensuring consistency across partitions. Labeling the cores of an input string consistently not only provides a compact representation of the input but also enables the reapplication of LCP to refine the cores over multiple iterations, providing a progressively longer and more informative set of substrings for downstream analyses. We present the first iterative implementation of LCP with Lcptools and demonstrate its effectiveness in identifying cores with minimal collisions. Experimental results show that the number of cores at the i^th iteration is O(n/c^i) for c ~ 2.34, while the average length and the average distance between consecutive cores are O(c^i). Compared to the popular sketching techniques, LCP produces significantly fewer cores, enabling a more compact representation and faster analyses. To demonstrate the advantages of LCP in genomic string processing in terms of computation and memory efficiency, we also introduce LCPan, an efficient variation graph constructor. We show that LCPan generates variation graphs >10x faster than vg, while using >13x less memory.

Paper Structure

This paper contains 12 sections, 4 theorems, 5 figures, 5 tables.

Key Result

lemma thmcounterlemma

Contiguity Property: There are no gaps between any pair of consecutive cores identified by LCP.

Figures (5)

  • Figure 1: Multi-thread scaling analysis of LCPan and vg using HPRC data on GRCh38. Memory and run time of graph construction using different numbers of threads. LCPan and vg were run using their native multithreading implementations. vg was additionally parallelized using GNU Parallel by distributing single-threaded processes on chunks across concurrent jobs (VG-GNU).
  • Figure 2: Processing a string using LCP. Here, blue underlines the core that satisfies the Local Minimum core, green represents a Local Maximum core, red corresponds to a Repetitive Interior core, and yellow denotes a Stranded Sequence core.
  • Figure 3: DCT for reducing bitstreams to a new alphabet. Each block above corresponds to the bitstream of a core. The DCT compares each core's bitstream to its left neighbor to form a shorter alphabet. E.g., the least significant bit of the core bitstream $11101011$, which differs from its left neighbor $011011$ at the fourth index (counting from the right and starting at index 0). The value of this bit is $0$. Therefore, DCT replaces the core bitstream $11101011$ with the concatenation of the bits of $4$, and the value of the differing bit, resulting in $1000$. The figure shows the reduced alphabet inferred by the DCT for each core bitstream, on which LCP can be applied.
  • Figure 4: Overview of the LCP with DCT and Labeling Paradigm.
  • Figure 5: Overview of the LCPan Method.

Theorems & Definitions (8)

  • definition thmcounterdefinition
  • definition thmcounterdefinition
  • lemma thmcounterlemma
  • lemma thmcounterlemma
  • lemma thmcounterlemma
  • proof
  • lemma thmcounterlemma
  • proof