Table of Contents
Fetching ...

Beyond ESM2: Graph-Enhanced Protein Sequence Modeling with Efficient Clustering

Shujian Jiao, Bingxuan Li, Lei Wang, Xiaojin Zhang, Wei Chen, Jiajie Peng, Zhongyu Wei

TL;DR

This work introduces ComproESM, a graph-enhanced protein sequence model that integrates Community Propagation-Based Clustering with masked language modeling to enrich global protein representations while preserving local amino acid context. By combining a graph-based clustering objective with MLM in a Transformer backbone, the approach leverages protein family and superfamily structure to guide pre-training, achieving state-of-the-art results on multiple downstream protein tasks. The authors validate their method on a 540k-protein dataset with 17k families and 3k superfamilies, showing improved performance in family classification, function prediction, and remote homology detection, while performing ablations to isolate the contributions of each component. The work highlights the value of jointly modeling hierarchical protein taxonomy and sequence context, offering a scalable framework for more functionally informative protein representations with potential implications for protein design and biology research.

Abstract

Proteins are essential to life's processes, underpinning evolution and diversity. Advances in sequencing technology have revealed millions of proteins, underscoring the need for sophisticated pre-trained protein models for biological analysis and AI development. Facebook's ESM2, the most advanced protein language model to date, leverages a masked prediction task for unsupervised learning, crafting amino acid representations with notable biochemical accuracy. Yet, it lacks in delivering functional protein insights, signaling an opportunity for enhancing representation quality.Our study addresses this gap by incorporating protein family classification into ESM2's training.This approach, augmented with Community Propagation-Based Clustering Algorithm, improves global protein representations, while a contextual prediction task fine-tunes local amino acid accuracy. Significantly, our model achieved state-of-the-art results in several downstream experiments, demonstrating the power of combining global and local methodologies to substantially boost protein representation quality.

Beyond ESM2: Graph-Enhanced Protein Sequence Modeling with Efficient Clustering

TL;DR

This work introduces ComproESM, a graph-enhanced protein sequence model that integrates Community Propagation-Based Clustering with masked language modeling to enrich global protein representations while preserving local amino acid context. By combining a graph-based clustering objective with MLM in a Transformer backbone, the approach leverages protein family and superfamily structure to guide pre-training, achieving state-of-the-art results on multiple downstream protein tasks. The authors validate their method on a 540k-protein dataset with 17k families and 3k superfamilies, showing improved performance in family classification, function prediction, and remote homology detection, while performing ablations to isolate the contributions of each component. The work highlights the value of jointly modeling hierarchical protein taxonomy and sequence context, offering a scalable framework for more functionally informative protein representations with potential implications for protein design and biology research.

Abstract

Proteins are essential to life's processes, underpinning evolution and diversity. Advances in sequencing technology have revealed millions of proteins, underscoring the need for sophisticated pre-trained protein models for biological analysis and AI development. Facebook's ESM2, the most advanced protein language model to date, leverages a masked prediction task for unsupervised learning, crafting amino acid representations with notable biochemical accuracy. Yet, it lacks in delivering functional protein insights, signaling an opportunity for enhancing representation quality.Our study addresses this gap by incorporating protein family classification into ESM2's training.This approach, augmented with Community Propagation-Based Clustering Algorithm, improves global protein representations, while a contextual prediction task fine-tunes local amino acid accuracy. Significantly, our model achieved state-of-the-art results in several downstream experiments, demonstrating the power of combining global and local methodologies to substantially boost protein representation quality.
Paper Structure (25 sections, 15 equations, 9 figures, 8 tables)

This paper contains 25 sections, 15 equations, 9 figures, 8 tables.

Figures (9)

  • Figure 1: The overview of our work.
  • Figure 2: Direction of information flow for Community Propagation-Based Clustering Algorithm
  • Figure 3: Model prediction layer representation of the biochemical properties of the embedded amino acids.
  • Figure 4: Protein representation after TSNE reduction
  • Figure 5: Community Propagation-Based Clustering Algorithm on random initialisation vectors
  • ...and 4 more figures