Hilbert Curve Based Molecular Sequence Analysis

Sarwan Ali; Tamkanat E Ali; Imdad Ullah Khan; Murray Patterson

Hilbert Curve Based Molecular Sequence Analysis

Sarwan Ali, Tamkanat E Ali, Imdad Ullah Khan, Murray Patterson

TL;DR

This paper tackles the representation bottleneck in molecular sequence analysis by proposing a Hilbert-curve-based Chaos Game Representation (CGR) that converts one-dimensional sequences into images.It introduces an Alphabetic Index Mapping and a coordinate pipeline using a Hilbert curve of order $p$ in $N$ dimensions with image size $2^p \times 2^p$ and total points $\Theta = 2^{p \cdot N}$, including a distance mapping $D = \frac{I}{L} \cdot \Theta$ and Gray-code based coordinate computation.Empirical evaluation on anticancer peptide datasets shows state-of-the-art performance, e.g., accuracy $89.5\%$ on breast cancer ACPs and $94.5\%$ on lung ACPs with a simple CNN, outperforming vector- and image-based baselines.This work demonstrates the viability of image-based sequence analysis and points toward future directions in domain adaptation and hybrid architectures.

Abstract

Accurate molecular sequence analysis is a key task in the field of bioinformatics. To apply molecular sequence classification algorithms, we first need to generate the appropriate representations of the sequences. Traditional numeric sequence representation techniques are mostly based on sequence alignment that faces limitations in the form of lack of accuracy. Although several alignment-free techniques have also been introduced, their tabular data form results in low performance when used with Deep Learning (DL) models compared to the competitive performance observed in the case of image-based data. To find a solution to this problem and to make Deep Learning (DL) models function to their maximum potential while capturing the important spatial information in the sequence data, we propose a universal Hibert curve-based Chaos Game Representation (CGR) method. This method is a transformative function that involves a novel Alphabetic index mapping technique used in constructing Hilbert curve-based image representation from molecular sequences. Our method can be globally applied to any type of molecular sequence data. The Hilbert curve-based image representations can be used as input to sophisticated vision DL models for sequence classification. The proposed method shows promising results as it outperforms current state-of-the-art methods by achieving a high accuracy of $94.5$\% and an F1 score of $93.9\%$ when tested with the CNN model on the lung cancer dataset. This approach opens up a new horizon for exploring molecular sequence analysis using image classification methods.

Hilbert Curve Based Molecular Sequence Analysis

TL;DR

Abstract

Hilbert Curve Based Molecular Sequence Analysis

Authors

TL;DR

Abstract

Table of Contents

Figures (1)

Theorems & Definitions (1)