Table of Contents
Fetching ...

Sequence Analysis Using the Bezier Curve

Taslim Murad, Sarwan Ali, Murray Patterson

TL;DR

The paper addresses the challenge of effectively analyzing biological sequences by transforming them into informative images for DL classifiers, overcoming sparse, high-dimensional vector representations. It introduces a Bézier-curve–based encoding that maps sequence elements onto a smooth curve, yielding denser, more informative images than conventional CGR methods. Across multiple protein, nucleotide, and chemical sequence datasets, Bézier-based images combined with DL classifiers (including CNNs and vision transformers) outperform traditional baselines and support improved visualization of class structure via t-SNE and confusion matrices. This approach offers a scalable, generalizable pathway for sequence classification and insight into biological patterns, with potential applications to larger datasets and nucleotide-level analyses.

Abstract

The analysis of sequences (e.g., protein, DNA, and SMILES string) is essential for disease diagnosis, biomaterial engineering, genetic engineering, and drug discovery domains. Conventional analytical methods focus on transforming sequences into numerical representations for applying machine learning/deep learning-based sequence characterization. However, their efficacy is constrained by the intrinsic nature of deep learning (DL) models, which tend to exhibit suboptimal performance when applied to tabular data. An alternative group of methodologies endeavors to convert biological sequences into image forms by applying the concept of Chaos Game Representation (CGR). However, a noteworthy drawback of these methods lies in their tendency to map individual elements of the sequence onto a relatively small subset of designated pixels within the generated image. The resulting sparse image representation may not adequately encapsulate the comprehensive sequence information, potentially resulting in suboptimal predictions. In this study, we introduce a novel approach to transform sequences into images using the Bézier curve concept for element mapping. Mapping the elements onto a curve enhances the sequence information representation in the respective images, hence yielding better DL-based classification performance. We employed different sequence datasets to validate our system by using different classification tasks, and the results illustrate that our Bézier curve method is able to achieve good performance for all the tasks.

Sequence Analysis Using the Bezier Curve

TL;DR

The paper addresses the challenge of effectively analyzing biological sequences by transforming them into informative images for DL classifiers, overcoming sparse, high-dimensional vector representations. It introduces a Bézier-curve–based encoding that maps sequence elements onto a smooth curve, yielding denser, more informative images than conventional CGR methods. Across multiple protein, nucleotide, and chemical sequence datasets, Bézier-based images combined with DL classifiers (including CNNs and vision transformers) outperform traditional baselines and support improved visualization of class structure via t-SNE and confusion matrices. This approach offers a scalable, generalizable pathway for sequence classification and insight into biological patterns, with potential applications to larger datasets and nucleotide-level analyses.

Abstract

The analysis of sequences (e.g., protein, DNA, and SMILES string) is essential for disease diagnosis, biomaterial engineering, genetic engineering, and drug discovery domains. Conventional analytical methods focus on transforming sequences into numerical representations for applying machine learning/deep learning-based sequence characterization. However, their efficacy is constrained by the intrinsic nature of deep learning (DL) models, which tend to exhibit suboptimal performance when applied to tabular data. An alternative group of methodologies endeavors to convert biological sequences into image forms by applying the concept of Chaos Game Representation (CGR). However, a noteworthy drawback of these methods lies in their tendency to map individual elements of the sequence onto a relatively small subset of designated pixels within the generated image. The resulting sparse image representation may not adequately encapsulate the comprehensive sequence information, potentially resulting in suboptimal predictions. In this study, we introduce a novel approach to transform sequences into images using the Bézier curve concept for element mapping. Mapping the elements onto a curve enhances the sequence information representation in the respective images, hence yielding better DL-based classification performance. We employed different sequence datasets to validate our system by using different classification tasks, and the results illustrate that our Bézier curve method is able to achieve good performance for all the tasks.

Paper Structure

This paper contains 43 sections, 5 equations, 21 figures, 15 tables, 1 algorithm.

Figures (21)

  • Figure 1: The workflow of our system to create an image from a given sequence and a number of parameters $m$. We have used "MAVM" as an input sequence here. Note that the $cur\_Pts$ consists of a set of values for x coordinates and y coordinates.
  • Figure 2: Confusion matrices of Protein Subcellular Localization dataset for 2layer CNN classifier using the FCGR and Bézier image generation methods.
  • Figure 3: Figure (a) shows the CGR-based determination of location for the "ATT" nucleotide sequence in the respective image. Figure (b) illustrates the 20-flakes-based image created using the FCGR method for a sequence of amino acids. (c) shows the CGR representation for the secondary protein structure.
  • Figure 4: The Bézier curve method-based images created for two sequences from the ACP dataset. One sequence belongs to the active class of the dataset, while the other is from the inactive class.
  • Figure 5: The genome structure of SARS-CoV-2 virus.
  • ...and 16 more figures