Table of Contents
Fetching ...

ProtFAD: Introducing function-aware domains as implicit modality towards protein function prediction

Mingqing Wang, Zhiwei Nie, Yonghong He, Athanasios V. Vasilakos, Zhixiang Ren

TL;DR

ProtFAD introduces function-aware domain embeddings as an implicit modality to bridge protein sequence/structure and function. By pre-training domain representations with GO priors and text semantics, and by employing a domain-attention fusion with a domain-joint triplet InfoNCE loss, the approach achieves consistent improvements over state-of-the-art methods across multiple protein-function benchmarks. The method also demonstrates that domain-level information can be highly discriminative for function prediction and offers robust multi-modal alignment even when some data modalities are scarce or noisy. Overall, ProtFAD enhances robustness and interpretability in protein function prediction by leveraging function priors encoded in protein domains.

Abstract

Protein function prediction is currently achieved by encoding its sequence or structure, where the sequence-to-function transcendence and high-quality structural data scarcity lead to obvious performance bottlenecks. Protein domains are "building blocks" of proteins that are functionally independent, and their combinations determine the diverse biological functions. However, most existing studies have yet to thoroughly explore the intricate functional information contained in the protein domains. To fill this gap, we propose a synergistic integration approach for a function-aware domain representation, and a domain-joint contrastive learning strategy to distinguish different protein functions while aligning the modalities. Specifically, we align the domain semantics with GO terms and text description to pre-train domain embeddings. Furthermore, we partition proteins into multiple sub-views based on continuous joint domains for contrastive training under the supervision of a novel triplet InfoNCE loss. Our approach significantly and comprehensively outperforms the state-of-the-art methods on various benchmarks, and clearly differentiates proteins carrying distinct functions compared to the competitor. Our implementation is available at https://github.com/AI-HPC-Research-Team/ProtFAD.

ProtFAD: Introducing function-aware domains as implicit modality towards protein function prediction

TL;DR

ProtFAD introduces function-aware domain embeddings as an implicit modality to bridge protein sequence/structure and function. By pre-training domain representations with GO priors and text semantics, and by employing a domain-attention fusion with a domain-joint triplet InfoNCE loss, the approach achieves consistent improvements over state-of-the-art methods across multiple protein-function benchmarks. The method also demonstrates that domain-level information can be highly discriminative for function prediction and offers robust multi-modal alignment even when some data modalities are scarce or noisy. Overall, ProtFAD enhances robustness and interpretability in protein function prediction by leveraging function priors encoded in protein domains.

Abstract

Protein function prediction is currently achieved by encoding its sequence or structure, where the sequence-to-function transcendence and high-quality structural data scarcity lead to obvious performance bottlenecks. Protein domains are "building blocks" of proteins that are functionally independent, and their combinations determine the diverse biological functions. However, most existing studies have yet to thoroughly explore the intricate functional information contained in the protein domains. To fill this gap, we propose a synergistic integration approach for a function-aware domain representation, and a domain-joint contrastive learning strategy to distinguish different protein functions while aligning the modalities. Specifically, we align the domain semantics with GO terms and text description to pre-train domain embeddings. Furthermore, we partition proteins into multiple sub-views based on continuous joint domains for contrastive training under the supervision of a novel triplet InfoNCE loss. Our approach significantly and comprehensively outperforms the state-of-the-art methods on various benchmarks, and clearly differentiates proteins carrying distinct functions compared to the competitor. Our implementation is available at https://github.com/AI-HPC-Research-Team/ProtFAD.
Paper Structure (33 sections, 20 equations, 7 figures, 5 tables)

This paper contains 33 sections, 20 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: The motivation and methodology of our method. The performance bottlenecks of existing protein function prediction methods are rooted in sequence-to-function transcendence and high-quality structural data scarcity. On the path from protein sequence to function, unlike explicit modalities such as sequence or structure, the protein domain is a function-oriented implicit modality carrying domain-function priors. By introducing this implicit transition modality, the sequence or structure modality and protein function can be bridged in a tangible way.
  • Figure 2: (a) Function-aware domain embeddings pre-training. The domain vocabularies are pre-trained with function priors, including the domain-GO probabilities and the text semantic consistency. (b) Multi-modal function prediction architecture. We show how to augment sequence-based and structure-based models with function-aware domain and domain-joint contrastive learning.
  • Figure 3: Protein domain-joint contrastive learning. Here, we demonstrate domain-joint contrastive learning between structure and joint domains, which distinguishes different protein functions while aligning the modalities. Contrastive learning between sequence and joint domains is similar.
  • Figure 4: The dimensionality-reduction visualizations of domain embeddings extracted by ESM-2 and our function-aware domain (FAD). Two common types of functions, including "cytoplasm" (a) and "metal-ion-binding" (b), are selected.
  • Figure 5: Establish associations between domains and GO terms through proteins. Define two distinct meta-paths and utilize them to summarize the connection between domains and GO terms, including domain-protein-function (red line) and domain-protein (yellow line).
  • ...and 2 more figures