Implementation Of Dynamic De Bruijn Graphs Via Learned Index

Riccardo Nigrelli

Implementation Of Dynamic De Bruijn Graphs Via Learned Index

Riccardo Nigrelli

TL;DR

This work tackles the memory and time bottlenecks of dynamic De Bruijn graph construction for large-scale sequencing by introducing a learned-index approach based on the Dynamic PGM-Index Set to index $k$-mers. The method supports creation, insertion, deletion, and search, and is optimized with online index construction and single-element vector representations to reduce memory while maintaining competitive performance. Compared to DynamicBOSS and other baselines, the learned-index approach demonstrates superior update and query efficiency on large datasets (e.g., $>10^8$ $k$-mers), though creation memory can be higher. The results suggest a practical, scalable path for dynamic graph representations in genomics, with future work aimed at extending $k$-mer sizes to $k$ up to 255 and exploring colored De Bruijn graphs.

Abstract

De Bruijn graphs are essential for sequencing data analysis and must be efficiently constructed and stored for large-scale population studies. They also need to be dynamic to allow updates such as adding or removing edges and nodes. Existing dynamic implementations include DynamicBOSS and dynamicDBG. In 2018, a new family of data structures called learned indexes was introduced by Tim Kraska and Alex Beutel, with a particularly efficient implementation proposed by Paolo Ferragina and Giorgio Vinciguerra in 2020. This paper presents a new method for implementing De Bruijn graphs using learned indexes and compares its performance with current implementations. The new method shows improved time and memory efficiency for edge and node insertions, particularly with large datasets (over 110 million k-mers).

Implementation Of Dynamic De Bruijn Graphs Via Learned Index

TL;DR

-mers. The method supports creation, insertion, deletion, and search, and is optimized with online index construction and single-element vector representations to reduce memory while maintaining competitive performance. Compared to DynamicBOSS and other baselines, the learned-index approach demonstrates superior update and query efficiency on large datasets (e.g.,

-mers), though creation memory can be higher. The results suggest a practical, scalable path for dynamic graph representations in genomics, with future work aimed at extending

-mer sizes to

up to 255 and exploring colored De Bruijn graphs.

Abstract

Paper Structure (4 sections)

This paper contains 4 sections.

Introduction
Methodology
Evaluation
Conclusions and Future Works

Implementation Of Dynamic De Bruijn Graphs Via Learned Index

TL;DR

Abstract

Implementation Of Dynamic De Bruijn Graphs Via Learned Index

Authors

TL;DR

Abstract

Table of Contents