Table of Contents
Fetching ...

Autoregressive Enzyme Function Prediction with Multi-scale Multi-modality Fusion

Dingyi Rong, Wenzhuo Zheng, Bozitao Zhong, Zhouhan Lin, Liang Hong, Ning Liu

TL;DR

MAPred tackles enzyme function prediction by integrating protein sequence with 3Di-based structural tokens. It employs a global-local feature extraction framework and an autoregressive predictor to output the four EC digits in a hierarchical order. Across multiple benchmarks, MAPred delivers state-of-the-art performance and demonstrates robustness through ablations, with attention analyses suggesting localization of catalytic regions. This work enables more reliable, granular enzyme annotations and provides interpretability by highlighting functional regions, with potential extension to GO terms in future work.

Abstract

Accurate prediction of enzyme function is crucial for elucidating biological mechanisms and driving innovation across various sectors. Existing deep learning methods tend to rely solely on either sequence data or structural data and predict the EC number as a whole, neglecting the intrinsic hierarchical structure of EC numbers. To address these limitations, we introduce MAPred, a novel multi-modality and multi-scale model designed to autoregressively predict the EC number of proteins. MAPred integrates both the primary amino acid sequence and the 3D tokens of proteins, employing a dual-pathway approach to capture comprehensive protein characteristics and essential local functional sites. Additionally, MAPred utilizes an autoregressive prediction network to sequentially predict the digits of the EC number, leveraging the hierarchical organization of EC classifications. Evaluations on benchmark datasets, including New-392, Price, and New-815, demonstrate that our method outperforms existing models, marking a significant advance in the reliability and granularity of protein function prediction within bioinformatics.

Autoregressive Enzyme Function Prediction with Multi-scale Multi-modality Fusion

TL;DR

MAPred tackles enzyme function prediction by integrating protein sequence with 3Di-based structural tokens. It employs a global-local feature extraction framework and an autoregressive predictor to output the four EC digits in a hierarchical order. Across multiple benchmarks, MAPred delivers state-of-the-art performance and demonstrates robustness through ablations, with attention analyses suggesting localization of catalytic regions. This work enables more reliable, granular enzyme annotations and provides interpretability by highlighting functional regions, with potential extension to GO terms in future work.

Abstract

Accurate prediction of enzyme function is crucial for elucidating biological mechanisms and driving innovation across various sectors. Existing deep learning methods tend to rely solely on either sequence data or structural data and predict the EC number as a whole, neglecting the intrinsic hierarchical structure of EC numbers. To address these limitations, we introduce MAPred, a novel multi-modality and multi-scale model designed to autoregressively predict the EC number of proteins. MAPred integrates both the primary amino acid sequence and the 3D tokens of proteins, employing a dual-pathway approach to capture comprehensive protein characteristics and essential local functional sites. Additionally, MAPred utilizes an autoregressive prediction network to sequentially predict the digits of the EC number, leveraging the hierarchical organization of EC classifications. Evaluations on benchmark datasets, including New-392, Price, and New-815, demonstrate that our method outperforms existing models, marking a significant advance in the reliability and granularity of protein function prediction within bioinformatics.
Paper Structure (22 sections, 7 equations, 4 figures, 2 tables)

This paper contains 22 sections, 7 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Overview of MAPred. The inputs consist of the protein sequences and their corresponding 3Di tokens obtained through ProstT5. Within the Feature Extraction Network, we employ both a global feature extraction pathway and a local feature extraction pathway to capture the overall characteristics of the proteins and their specific functional sites, respectively. These features are then merged using a fuse block. In the Prediction Network, an autoregressive prediction architecture is utilized to predict the label for each digit in the EC number.
  • Figure 2: The training stage of our model.
  • Figure 3: Performance comparison of methods across different EC occurrence frequencies and hierarchical digit accuracy analysis. (A) Evaluation on the combined datasets binned by the number of times that the EC number appeared in training dataset. (B) Comparative analysis of the accuracy in predicting each digit of the EC number.
  • Figure 4: Highlighted amino acid residues by the MAPred. The residues in blue indicate where the model pays more attention. (A) DNA polymerase IV dinB (UniProt ID: B8FBE8) with the catalytic center where polymerase reactions occur. (B) Glutamyl-tRNA amidotransferase gatA (UniProt ID: Q21RH9), with the reaction center and substrate glutamate in yellow. (C) Histone-lysine N-methyltransferase PRDM9 (UniProt ID: Q96EQ9), illustrating the reaction center where substrate lysine is modified.