Table of Contents
Fetching ...

Position Specific Scoring Is All You Need? Revisiting Protein Sequence Classification Tasks

Sarwan Ali, Taslim Murad, Prakash Chourasia, Haris Mansoor, Imdad Ullah Khan, Pin-Yu Chen, Murray Patterson

TL;DR

This work proposes a weighted PSS kernel matrix (or W-PSSKM), that combines a PSS representation of protein sequences, which encodes the frequency information of each amino acid in a sequence, with the notion of the string kernel, which results in a novel kernel function that outperforms many other approaches for protein sequence classification.

Abstract

Understanding the structural and functional characteristics of proteins are crucial for developing preventative and curative strategies that impact fields from drug discovery to policy development. An important and popular technique for examining how amino acids make up these characteristics of the protein sequences with position-specific scoring (PSS). While the string kernel is crucial in natural language processing (NLP), it is unclear if string kernels can extract biologically meaningful information from protein sequences, despite the fact that they have been shown to be effective in the general sequence analysis tasks. In this work, we propose a weighted PSS kernel matrix (or W-PSSKM), that combines a PSS representation of protein sequences, which encodes the frequency information of each amino acid in a sequence, with the notion of the string kernel. This results in a novel kernel function that outperforms many other approaches for protein sequence classification. We perform extensive experimentation to evaluate the proposed method. Our findings demonstrate that the W-PSSKM significantly outperforms existing baselines and state-of-the-art methods and achieves up to 45.1\% improvement in classification accuracy.

Position Specific Scoring Is All You Need? Revisiting Protein Sequence Classification Tasks

TL;DR

This work proposes a weighted PSS kernel matrix (or W-PSSKM), that combines a PSS representation of protein sequences, which encodes the frequency information of each amino acid in a sequence, with the notion of the string kernel, which results in a novel kernel function that outperforms many other approaches for protein sequence classification.

Abstract

Understanding the structural and functional characteristics of proteins are crucial for developing preventative and curative strategies that impact fields from drug discovery to policy development. An important and popular technique for examining how amino acids make up these characteristics of the protein sequences with position-specific scoring (PSS). While the string kernel is crucial in natural language processing (NLP), it is unclear if string kernels can extract biologically meaningful information from protein sequences, despite the fact that they have been shown to be effective in the general sequence analysis tasks. In this work, we propose a weighted PSS kernel matrix (or W-PSSKM), that combines a PSS representation of protein sequences, which encodes the frequency information of each amino acid in a sequence, with the notion of the string kernel. This results in a novel kernel function that outperforms many other approaches for protein sequence classification. We perform extensive experimentation to evaluate the proposed method. Our findings demonstrate that the W-PSSKM significantly outperforms existing baselines and state-of-the-art methods and achieves up to 45.1\% improvement in classification accuracy.

Paper Structure

This paper contains 39 sections, 11 equations, 7 figures, 11 tables, 2 algorithms.

Figures (7)

  • Figure 1: K-mers spectrum of two pairs of classes. (a) and (b) belongs to the same class, while (c) and (d) belong to different classes for Coronavirus Host dataset. The Gaussian kernel distance for (a) and (b) is almost 0 while for the W-PSSKM model is 3.23 (larger distance is better). The Gaussian kernel for (c) and (d) is 0.48 while for the W-PSSKM model is 0.39 (smaller distance is better).
  • Figure 2: K-mers spectrum of two pairs of classes. (a) and (b) belongs to the same class, while (c) and (d) belong to different classes for Spike7K dataset. The Gaussian kernel distance for (a) and (b) is almost 0 while for W-PSSKM model is 4.4 (larger distance is better). The Gaussian kernel for (c) and (d) is 0.57 while for W-PSSKM model is 0.49 (smaller distance is better).
  • Figure 3: K-mers spectrum of two pairs of classes. (a) and (b) belongs to the same class, while (c) and (d) belong to different classes for the Protein Subcellular dataset. The Gaussian kernel distance for (a) and (b) is almost 0 while for the W-PSSKM model is 4.4 (larger distance is better). The Gaussian kernel for (c) and (d) is 0.57 while for the W-PSSKM model is 0.49 (smaller distance is better).
  • Figure 4: Heatmap for classes in Coronavirus Host.
  • Figure 5: Heatmap comparison for classes in Protein Subcellular dataset.
  • ...and 2 more figures