Table of Contents
Fetching ...

SBSM-Pro: Support Bio-sequence Machine for Proteins

Yizheng Wang, Yixiao Zhai, Yijie Ding, Quan Zou

TL;DR

The support bio-sequence machine for proteins (SBSM-Pro), a model purpose-built for the classification of biological sequences, incorporates sequence alignment to measure the similarities between proteins and uses a novel multiple kernel learning (MKL) approach to integrate various types of information.

Abstract

Proteins play a pivotal role in biological systems. The use of machine learning algorithms for protein classification can assist and even guide biological experiments, offering crucial insights for biotechnological applications. We introduce the Support Bio-Sequence Machine for Proteins (SBSM-Pro), a model purpose-built for the classification of biological sequences. This model starts with raw sequences and groups amino acids based on their physicochemical properties. It incorporates sequence alignment to measure the similarities between proteins and uses a novel multiple kernel learning (MKL) approach to integrate various types of information, utilizing support vector machines for classification prediction. The results indicate that our model demonstrates commendable performance across ten datasets in terms of the identification of protein function and posttranslational modification. This research not only exemplifies state-of-the-art work in protein classification but also paves avenues for new directions in this domain, representing a beneficial endeavor in the development of platforms tailored for the classification of biological sequences. SBSM-Pro is available for access at http://lab.malab.cn/soft/SBSM-Pro/.

SBSM-Pro: Support Bio-sequence Machine for Proteins

TL;DR

The support bio-sequence machine for proteins (SBSM-Pro), a model purpose-built for the classification of biological sequences, incorporates sequence alignment to measure the similarities between proteins and uses a novel multiple kernel learning (MKL) approach to integrate various types of information.

Abstract

Proteins play a pivotal role in biological systems. The use of machine learning algorithms for protein classification can assist and even guide biological experiments, offering crucial insights for biotechnological applications. We introduce the Support Bio-Sequence Machine for Proteins (SBSM-Pro), a model purpose-built for the classification of biological sequences. This model starts with raw sequences and groups amino acids based on their physicochemical properties. It incorporates sequence alignment to measure the similarities between proteins and uses a novel multiple kernel learning (MKL) approach to integrate various types of information, utilizing support vector machines for classification prediction. The results indicate that our model demonstrates commendable performance across ten datasets in terms of the identification of protein function and posttranslational modification. This research not only exemplifies state-of-the-art work in protein classification but also paves avenues for new directions in this domain, representing a beneficial endeavor in the development of platforms tailored for the classification of biological sequences. SBSM-Pro is available for access at http://lab.malab.cn/soft/SBSM-Pro/.
Paper Structure (22 sections, 41 equations, 9 figures, 18 tables)

This paper contains 22 sections, 41 equations, 9 figures, 18 tables.

Figures (9)

  • Figure 1: Line graph comparison of ACC values between the proposed method and the existing methods. Distinct colored lines represent SBSM-Pro and existing methods.
  • Figure 2: Overview of the PSD process.a, Heatmaps of grid search parameter tuning in spectral clustering. Color mapping represents the magnitude of the CHI values, followed by generating a continuous color image to illustrate how CHI varies with changes in $k_{c}$ and $\gamma$. Darker colors indicate higher CHI values, suggesting a better combination of parameters. b, Visual representations of dictionaries for grouping. The upper half of the circle depicts 20 common amino acids, while the lower half showcases specific groups of amino acids. The amino acids in the upper section are linked to their corresponding groups below by arrows, signifying their affiliation. Adjacent to the circle on the right is a table detailing the parameters $k_{c}$ and $\gamma$ used for spectral clustering of the given groups, along with their respective CHI values.
  • Figure 3: Bar chart to compare the effects of different dictionaries for grouping. The orange and blue bars to represent the LS distance and SW score, respectively. We compared the performance of models with 10 datasets, distinguishing between those that utilize a dictionary and those that do not, with the latter being labelled as "not".
  • Figure 4: Concentric ring diagram illustrating the proportional kernel weights computed by HCKDM-MKL. The inner ring of the circle represents the proportions of two similarity measurement methods, the LS distance and SW score methods, each of which corresponds to the dictionaries depicted by different colors in the outer ring. The combination of ten dictionaries with two measurement methods results in a total of 20 similarity kernels. The weight proportions of these kernels within the fused kernel are visually represented in the outer ring.
  • Figure 5: Line graph comparing the effectiveness of different MKL methods. T The two lines depict the performance of the top-performing and least-performing kernels. These lines divide the chart into three sections, colored blue, green, and yellow, corresponding to areas A, B, and C, respectively. In area A, the MKL approach demonstrates superior performance. Through MKL, not only are weights assigned to different kernel matrices, highlighting the importance of well-performing kernels, but the performance is further enhanced, surpassing that of any single kernel. The effective method in area B only accomplishes the function of kernel selection, while the MKL method appearing in area C is considered substandard.
  • ...and 4 more figures