Table of Contents
Fetching ...

Pairwise sequence alignment with block and character edit operations

Ahmet Cemal Alıcıoğlu, Can Alkan

TL;DR

SABER introduces a practical, traceback-enabled heuristic for the block edit distance that incorporates block deletions, moves, reversals, and single-character edits to align sequences without reliance on genomic markers. It builds a block-matching score matrix $W$ and a decision array $N$ to produce detailed alignments and breakpoints, with runtime near $O(m^2 n \ell_{\text{range}})$ and optimizations via semi-global alignment and shared DP tables. In experiments on simulated rearrangements and a real MHC locus fragment, SABER achieved around 81% accuracy overall and demonstrated competitive or superior breakpoint recovery compared to traditional aligners, while scaling to moderate sequence lengths. The approach operates directly on sequences (DNA alphabet) and provides open-source software for rearrangement-aware sequence comparison, offering a foundation for genome-scale implementations with block-edit capabilities.

Abstract

Pairwise sequence comparison is one of the most fundamental problems in string processing. The most common metric to quantify the similarity between sequences S and T is edit distance, d(S,T), which corresponds to the number of characters that need to be substituted, deleted from, or inserted into S to generate T. However, fewer edit operations may be sufficient for some string pairs to transform one string to the other if larger rearrangements are permitted. Block edit distance refers to such changes in substring level (i.e., blocks) that "penalizes" entire block removals, insertions, copies, and reversals with the same cost as single-character edits (Lopresti & Tomkins, 1997). Most studies to calculate block edit distance to date aimed only to characterize the distance itself for applications in sequence nearest neighbor search without reporting the full alignment details. Although a few tools try to solve block edit distance for genomic sequences, such as GR-Aligner, they have limited functionality and are no longer maintained. Here, we present SABER, an algorithm to solve block edit distance that supports block deletions, block moves, and block reversals in addition to the classical single-character edit operations. Our algorithm runs in O(m^2.n.l_range) time for |S|=m, |T|=n and the permitted block size range of l_range; and can report all breakpoints for the block operations. We also provide an implementation of SABER currently optimized for genomic sequences (i.e., generated by the DNA alphabet), although the algorithm can theoretically be used for any alphabet. SABER is available at http://github.com/BilkentCompGen/saber

Pairwise sequence alignment with block and character edit operations

TL;DR

SABER introduces a practical, traceback-enabled heuristic for the block edit distance that incorporates block deletions, moves, reversals, and single-character edits to align sequences without reliance on genomic markers. It builds a block-matching score matrix and a decision array to produce detailed alignments and breakpoints, with runtime near and optimizations via semi-global alignment and shared DP tables. In experiments on simulated rearrangements and a real MHC locus fragment, SABER achieved around 81% accuracy overall and demonstrated competitive or superior breakpoint recovery compared to traditional aligners, while scaling to moderate sequence lengths. The approach operates directly on sequences (DNA alphabet) and provides open-source software for rearrangement-aware sequence comparison, offering a foundation for genome-scale implementations with block-edit capabilities.

Abstract

Pairwise sequence comparison is one of the most fundamental problems in string processing. The most common metric to quantify the similarity between sequences S and T is edit distance, d(S,T), which corresponds to the number of characters that need to be substituted, deleted from, or inserted into S to generate T. However, fewer edit operations may be sufficient for some string pairs to transform one string to the other if larger rearrangements are permitted. Block edit distance refers to such changes in substring level (i.e., blocks) that "penalizes" entire block removals, insertions, copies, and reversals with the same cost as single-character edits (Lopresti & Tomkins, 1997). Most studies to calculate block edit distance to date aimed only to characterize the distance itself for applications in sequence nearest neighbor search without reporting the full alignment details. Although a few tools try to solve block edit distance for genomic sequences, such as GR-Aligner, they have limited functionality and are no longer maintained. Here, we present SABER, an algorithm to solve block edit distance that supports block deletions, block moves, and block reversals in addition to the classical single-character edit operations. Our algorithm runs in O(m^2.n.l_range) time for |S|=m, |T|=n and the permitted block size range of l_range; and can report all breakpoints for the block operations. We also provide an implementation of SABER currently optimized for genomic sequences (i.e., generated by the DNA alphabet), although the algorithm can theoretically be used for any alphabet. SABER is available at http://github.com/BilkentCompGen/saber
Paper Structure (14 sections, 5 equations, 5 figures, 2 algorithms)

This paper contains 14 sections, 5 equations, 5 figures, 2 algorithms.

Figures (5)

  • Figure 1: Block edit operations. A) block move, B) block reversal, C) block deletion, D) block copy, E) block move with a reversal (i.e., inverted block move). The current version of Saber supports operations depicted in A, B, C, and E.
  • Figure 2: Pairwise alignment of sequences under block edit distance model. Here, we show one block move (red, cost=1) with a single character insertion (C in purple, cost=1), one inverted move (green, total cost=2), and one single character deletion (T in blue, cost=1). The total block edit distance is, therefore, 5.
  • Figure 3: Example calculation of Block-Edit-Distance (BED) from a given $N$ array.
  • Figure 4: Accuracy of Saber over different testing divergence.
  • Figure S1: Alignment of a portion of the MHC locus. We selected a 6 Kb region from the human reference genome (GRCh38) that corresponds to the MHC locus, and the corresponding 1.8 Kb portion of an alternative haplotype (chr6_GL000251v2_alt). We then used Saber to predict block deletions. We also used Clustal-W Thompson1994 to generate optimal pairwise alignment of the two sequences. We show both Saber and Clustal-W predicted deletions as User Tracks in the UCSC Genome Browser. Deletions predicted by Saber overlap at 88.3% with the UCSC alignment, where Clustal-W predicted deletions overlap at 77.2%.