Table of Contents
Fetching ...

Challenges in Solving Sequence-to-Graph Alignment with Co-Linear Structure

Xingfu Li

TL;DR

The paper investigates two co-linear chaining formulations for sequence-to-graph alignment: Gap-CLC and Edit-CLC. It formalizes anchors via Cartesian-product occurrences in query sequences and pan-genome graphs, and defines the corresponding optimization objectives, including single-variant forms. Through linear-time reductions, it shows Gap-CLC inherits the hardness of exact seed matching under the Strong Exponential Time Hypothesis ($SETH$), ruling out sub-quadratic algorithms, while Edit-CLC is NP-hard when the graph contains errors. These results suggest that incorporating co-linear structure does not alleviate computational complexity, underscoring a need for practical algorithms and precise anchor representations to bridge theory and practice.

Abstract

Sequence alignment is a cornerstone technique in computational biology for assessing similarities and differences among biological sequences. A key variant, sequence-to-graph alignment, plays a crucial role in effectively capturing genetic variations. In this work, we introduce two novel formulations within this framework: the Gap-Sensitive Co-Linear Chaining (Gap-CLC) problem and the Co-Linear Chaining with Errors based on Edit Distance (Edit-CLC) problem, and we investigate their computational complexity. We show that solving the Gap-CLC problem in sub-quadratic time is highly unlikely unless the Strong Exponential Time Hypothesis (SETH) fails -- even when restricted to binary alphabets. Furthermore, we establish that the Edit-CLC problem is NP-hard in the presence of errors within the graph. These findings emphasize that incorporating co-linear structures into sequence-to-graph alignment models fails to reduce computational complexity, highlighting that these models remain at least as computationally challenging to solve as those lacking such prior information.

Challenges in Solving Sequence-to-Graph Alignment with Co-Linear Structure

TL;DR

The paper investigates two co-linear chaining formulations for sequence-to-graph alignment: Gap-CLC and Edit-CLC. It formalizes anchors via Cartesian-product occurrences in query sequences and pan-genome graphs, and defines the corresponding optimization objectives, including single-variant forms. Through linear-time reductions, it shows Gap-CLC inherits the hardness of exact seed matching under the Strong Exponential Time Hypothesis (), ruling out sub-quadratic algorithms, while Edit-CLC is NP-hard when the graph contains errors. These results suggest that incorporating co-linear structure does not alleviate computational complexity, underscoring a need for practical algorithms and precise anchor representations to bridge theory and practice.

Abstract

Sequence alignment is a cornerstone technique in computational biology for assessing similarities and differences among biological sequences. A key variant, sequence-to-graph alignment, plays a crucial role in effectively capturing genetic variations. In this work, we introduce two novel formulations within this framework: the Gap-Sensitive Co-Linear Chaining (Gap-CLC) problem and the Co-Linear Chaining with Errors based on Edit Distance (Edit-CLC) problem, and we investigate their computational complexity. We show that solving the Gap-CLC problem in sub-quadratic time is highly unlikely unless the Strong Exponential Time Hypothesis (SETH) fails -- even when restricted to binary alphabets. Furthermore, we establish that the Edit-CLC problem is NP-hard in the presence of errors within the graph. These findings emphasize that incorporating co-linear structures into sequence-to-graph alignment models fails to reduce computational complexity, highlighting that these models remain at least as computationally challenging to solve as those lacking such prior information.
Paper Structure (5 sections, 6 theorems, 2 tables)

This paper contains 5 sections, 6 theorems, 2 tables.

Key Result

Lemma 1

Given an alphabet $\Sigma$, a query sequence $Q\in \Sigma^{*}\setminus\{\epsilon\}$, a pan-genome graph $G = (V, E, \delta)$ and a set of anchors $A$ on $(Q,G)$, there are two additional sets $X_q\subseteq loc(q,Q)$ and $Y_q\subseteq loc(q,G)$ for each sequence $q\in R$, such that $A \subseteq \bigc

Theorems & Definitions (12)

  • Lemma 1
  • proof
  • Lemma 2
  • proof
  • Lemma 3
  • proof
  • Lemma 4
  • proof
  • Theorem 1
  • proof
  • ...and 2 more