Topological Sequence Analysis of Genomes: Category theory Approaches

Jian Liu; Li Shen; Mushal Zia; Guo-Wei Wei

Topological Sequence Analysis of Genomes: Category theory Approaches

Jian Liu, Li Shen, Mushal Zia, Guo-Wei Wei

TL;DR

CTSA introduces a category-theoretic, multiscale approach to genomic and proteomic sequence analysis by modeling sequences as resolution categories and computing persistent homology of derived substructure complexes. It merges alignment-free scalability with structure-aware information via anchored subsequences and a hierarchy of k-mer spaces, offering robust topological embeddings. In protein–nucleic acid binding prediction, CTSA achieves Pearson correlation $r=0.709$ and RMSE $1.29$ kcal/mol, and in SARS-CoV-2 variant clustering it attains 100% accuracy, outperforming multiple baselines. The results suggest that topological features from category-theoretic representations capture informative sequence structure across tasks and could generalize to other domains and language-model integrations.

Abstract

Sequence data, such as DNA, RNA, and protein sequences, exhibit intricate, multi-scale structures that pose significant challenges for conventional analysis methods, particularly those relying on alignment or purely statistical representations. In this work, we introduce category-based topological sequence analysis (CTSA ) of genomes. CTSA models a sequence as a resolution category, capturing its hierarchical structure through a categorical construction. Substructure complexes are then derived from this categorical representation, and their persistent homology is computed to extract multi-scale topological features. Our models depart from traditional alignment-free approaches by incorporating structured mathematical formalisms rooted in sequence topology. The resulting topological signatures provide informative representations across a variety of tasks, including the phylogenetic analysis of SARS-CoV-2 variants and the prediction of protein-nucleic acid binding affinities. Comparative studies were carried out against six state-of-the-art methods. Experimental results demonstrate that CTSA achieves excellent and consistent performance in these tasks, suggesting its general applicability and robustness. Beyond sequence analysis, the proposed framework opens new directions for the integration of categorical and homological theories for biological sequence analysis.

Topological Sequence Analysis of Genomes: Category theory Approaches

TL;DR

Abstract

Topological Sequence Analysis of Genomes: Category theory Approaches

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (3)

Theorems & Definitions (4)