Table of Contents
Fetching ...

Nearest Neighbor CCP-Based Molecular Sequence Analysis

Sarwan Ali, Prakash Chourasia, Bipin Koirala, Murray Patterson

TL;DR

Findings show that CCP-NN considerably improves the accuracy of the classification task and significantly outperforms CCP in terms of computational runtime, and significantly outperforms CCP in terms of computational runtime.

Abstract

Molecular sequence analysis is crucial for comprehending several biological processes, including protein-protein interactions, functional annotation, and disease classification. The large number of sequences and the inherently complicated nature of protein structures make it challenging to analyze such data. Finding patterns and enhancing subsequent research requires the use of dimensionality reduction and feature selection approaches. Recently, a method called Correlated Clustering and Projection (CCP) has been proposed as an effective method for biological sequencing data. The CCP technique is still costly to compute even though it is effective for sequence visualization. Furthermore, its utility for classifying molecular sequences is still uncertain. To solve these two problems, we present a Nearest Neighbor Correlated Clustering and Projection (CCP-NN)-based technique for efficiently preprocessing molecular sequence data. To group related molecular sequences and produce representative supersequences, CCP makes use of sequence-to-sequence correlations. As opposed to conventional methods, CCP doesn't rely on matrix diagonalization, therefore it can be applied to a range of machine-learning problems. We estimate the density map and compute the correlation using a nearest-neighbor search technique. We performed molecular sequence classification using CCP and CCP-NN representations to assess the efficacy of our proposed approach. Our findings show that CCP-NN considerably improves classification task accuracy as well as significantly outperforms CCP in terms of computational runtime.

Nearest Neighbor CCP-Based Molecular Sequence Analysis

TL;DR

Findings show that CCP-NN considerably improves the accuracy of the classification task and significantly outperforms CCP in terms of computational runtime, and significantly outperforms CCP in terms of computational runtime.

Abstract

Molecular sequence analysis is crucial for comprehending several biological processes, including protein-protein interactions, functional annotation, and disease classification. The large number of sequences and the inherently complicated nature of protein structures make it challenging to analyze such data. Finding patterns and enhancing subsequent research requires the use of dimensionality reduction and feature selection approaches. Recently, a method called Correlated Clustering and Projection (CCP) has been proposed as an effective method for biological sequencing data. The CCP technique is still costly to compute even though it is effective for sequence visualization. Furthermore, its utility for classifying molecular sequences is still uncertain. To solve these two problems, we present a Nearest Neighbor Correlated Clustering and Projection (CCP-NN)-based technique for efficiently preprocessing molecular sequence data. To group related molecular sequences and produce representative supersequences, CCP makes use of sequence-to-sequence correlations. As opposed to conventional methods, CCP doesn't rely on matrix diagonalization, therefore it can be applied to a range of machine-learning problems. We estimate the density map and compute the correlation using a nearest-neighbor search technique. We performed molecular sequence classification using CCP and CCP-NN representations to assess the efficacy of our proposed approach. Our findings show that CCP-NN considerably improves classification task accuracy as well as significantly outperforms CCP in terms of computational runtime.
Paper Structure (31 sections, 15 equations, 4 figures, 18 tables, 2 algorithms)

This paper contains 31 sections, 15 equations, 4 figures, 18 tables, 2 algorithms.

Figures (4)

  • Figure 1: t-SNE plots (Protein Subcellular Data) for different structure embeddings and Clustering and Projection methods (CCP and CCP-NN). The figure is best seen in color.
  • Figure 2: t-SNE plots (Coronavirus Host Data) for different structure embeddings and Clustering and Projection methods (CCP and CCP-NN). The figure is best seen in color.
  • Figure 3: t-SNE plots (Coronavirus Host Data) for different structure embeddings and Clustering and Projection methods (CCP and CCP-NN). The figure is best seen in color.
  • Figure 4: Runtime for embedding generation of Autoencoder with an increasing number of data points for different datasets. The figure is best seen in color.

Theorems & Definitions (2)

  • Remark 1
  • Remark 2