Table of Contents
Fetching ...

Clustering for Protein Representation Learning

Ruijie Quan, Wenguan Wang, Fan Ma, Hehe Fan, Yi Yang

TL;DR

A neural clustering framework that can automatically discover the critical components of a protein by considering both its primary and tertiary structure information and achieves state-of-the-art performance.

Abstract

Protein representation learning is a challenging task that aims to capture the structure and function of proteins from their amino acid sequences. Previous methods largely ignored the fact that not all amino acids are equally important for protein folding and activity. In this article, we propose a neural clustering framework that can automatically discover the critical components of a protein by considering both its primary and tertiary structure information. Our framework treats a protein as a graph, where each node represents an amino acid and each edge represents a spatial or sequential connection between amino acids. We then apply an iterative clustering strategy to group the nodes into clusters based on their 1D and 3D positions and assign scores to each cluster. We select the highest-scoring clusters and use their medoid nodes for the next iteration of clustering, until we obtain a hierarchical and informative representation of the protein. We evaluate on four protein-related tasks: protein fold classification, enzyme reaction classification, gene ontology term prediction, and enzyme commission number prediction. Experimental results demonstrate that our method achieves state-of-the-art performance.

Clustering for Protein Representation Learning

TL;DR

A neural clustering framework that can automatically discover the critical components of a protein by considering both its primary and tertiary structure information and achieves state-of-the-art performance.

Abstract

Protein representation learning is a challenging task that aims to capture the structure and function of proteins from their amino acid sequences. Previous methods largely ignored the fact that not all amino acids are equally important for protein folding and activity. In this article, we propose a neural clustering framework that can automatically discover the critical components of a protein by considering both its primary and tertiary structure information. Our framework treats a protein as a graph, where each node represents an amino acid and each edge represents a spatial or sequential connection between amino acids. We then apply an iterative clustering strategy to group the nodes into clusters based on their 1D and 3D positions and assign scores to each cluster. We select the highest-scoring clusters and use their medoid nodes for the next iteration of clustering, until we obtain a hierarchical and informative representation of the protein. We evaluate on four protein-related tasks: protein fold classification, enzyme reaction classification, gene ontology term prediction, and enzyme commission number prediction. Experimental results demonstrate that our method achieves state-of-the-art performance.
Paper Structure (23 sections, 9 equations, 7 figures, 8 tables, 1 algorithm)

This paper contains 23 sections, 9 equations, 7 figures, 8 tables, 1 algorithm.

Figures (7)

  • Figure 1: Overview of our iterative neural clustering pipeline for protein representation learning: (a) input protein with amino acids, (b) iterative clustering algorithm which repeatedly stacks three steps $\circlearrowright$($\bigtriangleup\square\hbox{$\bigtriangledown$}$), (c) output can be seen as the critical amino acids of the protein, (d) output amino acids used for classification. The details of our iterative neural clustering method can be seen in § \ref{['subsec:hier_clus']}.
  • Figure 2: Our neural clustering framework architecture with four iterations. Given a protein, a set of 1D and 3D amino acids, our method adopts an iterative clustering algorithm to explore the most representative amino acids. At each iteration, $B$ cluster representation extraction blocks are utilized to extract cluster features. The clustering nomination operation selects the fraction $\omega$ of amino acids for the next iteration, that ${N_t}\!=\!\lfloor\omega\cdot{N_{t-1}}\rfloor$. Details of the framework can be seen in § \ref{['subsec:impl']}.
  • Figure 3: Performance change curve with different combinations of $\omega$ and $r$ for enzyme reaction classification. See § \ref{['subsec:analysis']} for details.
  • Figure 4: Visualization results of the protein structure at each iteration. The color of the node denotes the score calculated in CN step. See related analysis in § \ref{['sec:visualization']}.
  • Figure 5: Clustering$_{\!}$ results$_{\!}$ for$_{\!}$ a$_{\!}$ protein$_{\!}$ exhibit$_{\!}$ variations$_{\!}$ across$_{\!}$ EC$_{\!}$ and$_{\!}$ GO-MF$_{\!}$ predictions$_{\!}$. See related analysis in § \ref{['sec:visualization']}.
  • ...and 2 more figures