Table of Contents
Fetching ...

A Graph-based Approach to Variant Extraction from Sequences

Mark A. Santcroos, Walter A. Kosters, Mihai Lefter, Jeroen F. J. Laros, Jonathan K. Vis

TL;DR

Frames the challenge of generating unambiguous, machine-readable variant descriptions from sequences and introduces a graph-based representation (cLCS-graph) that encodes all minimal LCS alignments. It then presents three extraction methods—supremal, local supremal, and canonical HGVS-oriented variants—to derive readable descriptions, including HGVS translations and repeat-aware constructs. Experimental comparison against dbSNP HGVS shows identical results for simple variants and more informative descriptions for complex repeats, demonstrating the method's practical benefits for variant representation and potential standardization influence. The work provides a complete variant representation framework with implications for database storage, lookup, and downstream clinical and research use, while acknowledging scalability and domain-specific limitations.

Abstract

Accurate variant descriptions are of paramount importance in the field of genomics. The domain is confronted with increasingly complex variants, e.g., combinations of multiple indels, making it challenging to generate proper variant descriptions directly from chromosomal sequences. We present a graph based on all minimal alignments that is a complete representation of a variant, which gives insight into the nature of a variant compared to a single variant description. We provide three complementary extraction methods to derive variant descriptions from this graph, including one that yields domain-specific constructs from the HGVS nomenclature. Our experiments show that our methods in comparison with dbSNP, the authoritative variant database from the NCBI, result in identical HGVS descriptions for simple variants and more meaningful descriptions for complex variants, in particular for repeat expansions and contractions.

A Graph-based Approach to Variant Extraction from Sequences

TL;DR

Frames the challenge of generating unambiguous, machine-readable variant descriptions from sequences and introduces a graph-based representation (cLCS-graph) that encodes all minimal LCS alignments. It then presents three extraction methods—supremal, local supremal, and canonical HGVS-oriented variants—to derive readable descriptions, including HGVS translations and repeat-aware constructs. Experimental comparison against dbSNP HGVS shows identical results for simple variants and more informative descriptions for complex repeats, demonstrating the method's practical benefits for variant representation and potential standardization influence. The work provides a complete variant representation framework with implications for database storage, lookup, and downstream clinical and research use, while acknowledging scalability and domain-specific limitations.

Abstract

Accurate variant descriptions are of paramount importance in the field of genomics. The domain is confronted with increasingly complex variants, e.g., combinations of multiple indels, making it challenging to generate proper variant descriptions directly from chromosomal sequences. We present a graph based on all minimal alignments that is a complete representation of a variant, which gives insight into the nature of a variant compared to a single variant description. We provide three complementary extraction methods to derive variant descriptions from this graph, including one that yields domain-specific constructs from the HGVS nomenclature. Our experiments show that our methods in comparison with dbSNP, the authoritative variant database from the NCBI, result in identical HGVS descriptions for simple variants and more meaningful descriptions for complex variants, in particular for repeat expansions and contractions.

Paper Structure

This paper contains 21 sections, 5 equations, 10 figures, 1 table.

Figures (10)

  • Figure 1: A graph containing all LCS alignments for the strings ACCTGACT and ATCTTACTT. The paths through the graph correspond with six HGVS descriptions. The highlighted path with the description [2C>T;5G>T;8_9insT] corresponds to the HGVS recommendations.
  • Figure 2: The computed elements by the original ONP algorithm with their values according to Equation \ref{['eq:lcs']} for $R = \texttt{ACCTGACT}$ and $O = \texttt{ATCTTACTT}$. The dashed line marks the contour of the solution space, the final set $\mathit{FP}$. Matching symbols are circled.
  • Figure 3: The computed elements of our extension of the ONP algorithm with their values according to Equation \ref{['eq:lcs']} for $R = \texttt{ACCTGACT}$ and $O = \texttt{ATCTTACTT}$. Consecutive matches are grouped.
  • Figure 4: LCS-graph of $R = \texttt{ACCTGACT}$ and $O = \texttt{ATCTTACTT}$. Nodes represent matches and are labeled with their respective positions $(i, j)$. Edges describe the minimal replacements between the nodes. Dashed edges represent empty replacements. The leading edge indicates the source node and the double-circled node indicates the sink node.
  • Figure 5: Compressed LCS-graph of $R = \texttt{ACCTGACT}$ and $O = \texttt{ATCTTACTT}$. Nodes represent matches and are labeled as $(i, j, \ell)$. Edges describe the minimal replacements between the nodes. Multiple edges representing unique minimal replacements between two nodes can exist.
  • ...and 5 more figures