Table of Contents
Fetching ...

RXNRECer Enables Fine-grained Enzymatic Function Annotation through Active Learning and Protein Language Models

Zhenkun Shi, Jun Zhu, Dehang Wang, BoYu Chen, Qianqian Yuan, Zhitao Mao, Fan Wei, Weining Wu, Xiaoping Liao, Hongwu Ma

Abstract

A key challenge in enzyme annotation is identifying the biochemical reactions catalyzed by proteins. Most existing methods rely on Enzyme Commission (EC) numbers as intermediaries: they first predict an EC number and then retrieve the associated reactions. This indirect strategy introduces ambiguity due to the complex many-to-many mappings among proteins, EC numbers, and reactions, and is further complicated by frequent updates to EC numbers and inconsistencies across databases. To address these challenges, we present RXNRECer, a transformer-based ensemble framework that directly predicts enzyme-catalyzed reactions without relying on EC numbers. It integrates protein language modeling and active learning to capture both high-level sequence semantics and fine-grained transformation patterns. Evaluations on curated cross-validation and temporal test sets demonstrate consistent improvements over six EC-based baselines, with gains of 16.54% in F1 score and 15.43% in accuracy. Beyond accuracy gains, the framework offers clear advantages for downstream applications, including scalable proteome-wide reaction annotation, enhanced specificity in refining generic reaction schemas, systematic annotation of previously uncurated proteins, and reliable identification of enzyme promiscuity. By incorporating large language models, it also provides interpretable rationales for predictions. These capabilities make RXNRECer a robust and versatile solution for EC-free, fine-grained enzyme function prediction, with potential applications across multiple areas of enzyme research and industrial applications.

RXNRECer Enables Fine-grained Enzymatic Function Annotation through Active Learning and Protein Language Models

Abstract

A key challenge in enzyme annotation is identifying the biochemical reactions catalyzed by proteins. Most existing methods rely on Enzyme Commission (EC) numbers as intermediaries: they first predict an EC number and then retrieve the associated reactions. This indirect strategy introduces ambiguity due to the complex many-to-many mappings among proteins, EC numbers, and reactions, and is further complicated by frequent updates to EC numbers and inconsistencies across databases. To address these challenges, we present RXNRECer, a transformer-based ensemble framework that directly predicts enzyme-catalyzed reactions without relying on EC numbers. It integrates protein language modeling and active learning to capture both high-level sequence semantics and fine-grained transformation patterns. Evaluations on curated cross-validation and temporal test sets demonstrate consistent improvements over six EC-based baselines, with gains of 16.54% in F1 score and 15.43% in accuracy. Beyond accuracy gains, the framework offers clear advantages for downstream applications, including scalable proteome-wide reaction annotation, enhanced specificity in refining generic reaction schemas, systematic annotation of previously uncurated proteins, and reliable identification of enzyme promiscuity. By incorporating large language models, it also provides interpretable rationales for predictions. These capabilities make RXNRECer a robust and versatile solution for EC-free, fine-grained enzyme function prediction, with potential applications across multiple areas of enzyme research and industrial applications.
Paper Structure (27 sections, 11 equations, 6 figures)

This paper contains 27 sections, 11 equations, 6 figures.

Figures (6)

  • Figure 1: Schematic overview of the RXNRECer framework and its active learning workflow. (I) RXNRECer Framework. (a) Sequence-based reaction prediction (RXNRECer-S1). Protein sequences are embedded using a pre-trained language model. The resulting multi-level embeddings are passed through GRU and Transformer layers, followed by a fully connected (FC) layer to produce reaction scores. This EC-independent prediction module is fine-tuned via a reaction-level classification task within an active learning framework. (b) Ensemble integration (RXNRECer-S2). Predictions from EC-based and PLM similarity-based methods are combined with the direct predictor using a voting scheme to improve robustness across diverse reaction types. (c) Interpretation and ranking (RXNRECer-S3). A general-purpose language model (GLM) is prompted to re-rank the candidate reactions and generate concise, biologically meaningful justifications. (II) Active Learning Strategy. (a) Initial model training. The model is initially trained on a small labeled set $D_{Li}$. (b) Validation. Performance is evaluated on a validation set $D_V$ sampled from the unlabeled pool $D_U$. (c–e) Sample selection. An acquisition function $S(x)$ identifies the most informative samples for labeling. (f–g) Fine-tuning. The selected samples are incorporated into the training set $D_L$ to refine the model. (h–i) Validation sampling. Additional samples are selected to monitor performance after each fine-tuning iteration.
  • Figure 2: Comparative evaluation of enzyme reaction prediction methods on two datasets. (I) Results on the 10-fold cross-validation dataset (ds_rcv). (II) Results on the independent temporal test dataset (ds_rcp). For each panel: (a) Performance of EC-based methods (ECRECer, CatFam, CLEAN, PRIAM, DeepEC, and MSA-via-EC). (b) Performance of PLM-based similarity methods (T5-cosine, T5-euclidean, ESM-cosine, ESM-euclidean, Unirep-cosine, Unirep-euclidean), the end-to-end fine-tuned RXNRECer-S1, and MSA-via-RXN. Results on ds_rcv are averaged over 10 folds with error bars showing standard deviations; results on ds_rcp are from a single evaluation without error bars.
  • Figure 3: Performance evaluation of ensemble strategies on ds_rcp from three perspectives: (a) macro-averaged Precision (mPrecision), (b) macro-averaged Recall (mRecall), and (c) run time with respect to the number of predictors. Boxplots summarize the distribution of results across different ensemble settings, with individual points representing Recall-boosted (red) and Majority-vote (blue) strategies, and orange dots indicating the average values. The results highlight the trade-offs between precision, recall, and computational cost when adopting different ensemble strategies.
  • Figure 4: Case study on large-scale reaction prediction for the Fusarium venenatum proteome (FS12832). (I) Coverage Comparison. Heatmap showing the proportion of FS12832 proteins assigned to different functional categories by various methods. (II) Structure Consistency Evaluation. (a) Kernel Density Estimation (KDE) plot comparing TM-score distributions between RXNRECer and CLEAN. (b) Bin-wise TM-score comparison and Wilcoxon test showing RXNRECer achieves higher structural consistency. (III) Non-Enzyme Misannotated as Enzyme. (a) TM-score heatmap showing low structural similarity between CLEAN-predicted proteins and canonical enzymes. (b) Structural alignment confirming CLEAN incorrectly assigned EC 2.4.99.19 to non-enzymes. (IV) RXNRECer vs. CLEAN. (a) TM-score heatmap comparing predictions from CLEAN and RXNRECer. (b) Structural alignment showing RXNRECer correctly identifies enzyme functions, while CLEAN fails.
  • Figure 5: (I) Representative cases illustrating RXNRECer’s capability to resolve general reaction schemas into substrate-specific enzymatic transformations. (a) refines the generic aldehyde oxidation (RHEA:16185) to a specific reaction acting on acetaldehyde (RHEA:25294); (b) resolves a primary amine oxidation template (RHEA:16153) into a specific transformation involving 2-phenylethylamine (RHEA:22265). (II) Representative cases illustrating RXNRECer’s capability to predict reaction-level functions in reviewed proteins lacking curated catalytic annotations, exemplified by P31852 (a) and P73408 (b).
  • ...and 1 more figures