A Graph-based Approach to Variant Extraction from Sequences
Mark A. Santcroos, Walter A. Kosters, Mihai Lefter, Jeroen F. J. Laros, Jonathan K. Vis
TL;DR
Frames the challenge of generating unambiguous, machine-readable variant descriptions from sequences and introduces a graph-based representation (cLCS-graph) that encodes all minimal LCS alignments. It then presents three extraction methods—supremal, local supremal, and canonical HGVS-oriented variants—to derive readable descriptions, including HGVS translations and repeat-aware constructs. Experimental comparison against dbSNP HGVS shows identical results for simple variants and more informative descriptions for complex repeats, demonstrating the method's practical benefits for variant representation and potential standardization influence. The work provides a complete variant representation framework with implications for database storage, lookup, and downstream clinical and research use, while acknowledging scalability and domain-specific limitations.
Abstract
Accurate variant descriptions are of paramount importance in the field of genomics. The domain is confronted with increasingly complex variants, e.g., combinations of multiple indels, making it challenging to generate proper variant descriptions directly from chromosomal sequences. We present a graph based on all minimal alignments that is a complete representation of a variant, which gives insight into the nature of a variant compared to a single variant description. We provide three complementary extraction methods to derive variant descriptions from this graph, including one that yields domain-specific constructs from the HGVS nomenclature. Our experiments show that our methods in comparison with dbSNP, the authoritative variant database from the NCBI, result in identical HGVS descriptions for simple variants and more meaningful descriptions for complex variants, in particular for repeat expansions and contractions.
