Table of Contents
Fetching ...

Deep Manifold Transformation for Protein Representation Learning

Bozhen Hu, Zelin Zang, Cheng Tan, Stan Z. Li

TL;DR

This work tackles the challenge of learning universal and robust protein representations under limited data by introducing DMTPRL, a deep manifold transformation framework. It uses data augmentation and a novel manifold loss that preserves inter-node similarities across two latent spaces, enforcing topology alignment between latent graphs. EMPRICAL results show state-of-the-art performance on PPI identification, protein fold and enzyme reaction classification, and amino-acid contact prediction, demonstrating improved generalization across tasks. The approach offers a principled inductive bias for protein graph embeddings and has potential for large-scale pretraining on extensive protein datasets.

Abstract

Protein representation learning is critical in various tasks in biology, such as drug design and protein structure or function prediction, which has primarily benefited from protein language models and graph neural networks. These models can capture intrinsic patterns from protein sequences and structures through masking and task-related losses. However, the learned protein representations are usually not well optimized, leading to performance degradation due to limited data, difficulty adapting to new tasks, etc. To address this, we propose a new \underline{d}eep \underline{m}anifold \underline{t}ransformation approach for universal \underline{p}rotein \underline{r}epresentation \underline{l}earning (DMTPRL). It employs manifold learning strategies to improve the quality and adaptability of the learned embeddings. Specifically, we apply a novel manifold learning loss during training based on the graph inter-node similarity. Our proposed DMTPRL method outperforms state-of-the-art baselines on diverse downstream tasks across popular datasets. This validates our approach for learning universal and robust protein representations. We promise to release the code after acceptance.

Deep Manifold Transformation for Protein Representation Learning

TL;DR

This work tackles the challenge of learning universal and robust protein representations under limited data by introducing DMTPRL, a deep manifold transformation framework. It uses data augmentation and a novel manifold loss that preserves inter-node similarities across two latent spaces, enforcing topology alignment between latent graphs. EMPRICAL results show state-of-the-art performance on PPI identification, protein fold and enzyme reaction classification, and amino-acid contact prediction, demonstrating improved generalization across tasks. The approach offers a principled inductive bias for protein graph embeddings and has potential for large-scale pretraining on extensive protein datasets.

Abstract

Protein representation learning is critical in various tasks in biology, such as drug design and protein structure or function prediction, which has primarily benefited from protein language models and graph neural networks. These models can capture intrinsic patterns from protein sequences and structures through masking and task-related losses. However, the learned protein representations are usually not well optimized, leading to performance degradation due to limited data, difficulty adapting to new tasks, etc. To address this, we propose a new \underline{d}eep \underline{m}anifold \underline{t}ransformation approach for universal \underline{p}rotein \underline{r}epresentation \underline{l}earning (DMTPRL). It employs manifold learning strategies to improve the quality and adaptability of the learned embeddings. Specifically, we apply a novel manifold learning loss during training based on the graph inter-node similarity. Our proposed DMTPRL method outperforms state-of-the-art baselines on diverse downstream tasks across popular datasets. This validates our approach for learning universal and robust protein representations. We promise to release the code after acceptance.
Paper Structure (11 sections, 9 equations, 1 figure, 3 tables)

This paper contains 11 sections, 9 equations, 1 figure, 3 tables.

Figures (1)

  • Figure 1: A pipeline to illustrate the process of deep manifold transformation, $\theta$ is the network parameter.