Table of Contents
Fetching ...

ReactZyme: A Benchmark for Enzyme-Reaction Prediction

Chenqing Hua, Bozitao Zhong, Sitao Luan, Liang Hong, Guy Wolf, Doina Precup, Shuangjia Zheng

TL;DR

ReactZyme tackles enzyme function by predicting catalyzed reactions directly, reframing enzyme annotation as an enzyme-reaction retrieval problem. It builds the largest enzyme-reaction dataset to date from SwissProt and Rhea, and introduces a multi-view representation that combines reaction SMILES/graphs with structure-aware enzyme embeddings derived from protein language models and AlphaFold structures. Across time- and enzyme-similarity splits, 2D/3D graph representations and PLMs deliver strong retrieval performance, while the reaction-similarity split remains particularly challenging, highlighting a key area for methodological advances such as contrastive learning or more sophisticated decoders. The benchmark and baselines establish a foundation for future enzyme discovery and function annotation efforts, with public data and code enabling broader reuse and evaluation.

Abstract

Enzymes, with their specific catalyzed reactions, are necessary for all aspects of life, enabling diverse biological processes and adaptations. Predicting enzyme functions is essential for understanding biological pathways, guiding drug development, enhancing bioproduct yields, and facilitating evolutionary studies. Addressing the inherent complexities, we introduce a new approach to annotating enzymes based on their catalyzed reactions. This method provides detailed insights into specific reactions and is adaptable to newly discovered reactions, diverging from traditional classifications by protein family or expert-derived reaction classes. We employ machine learning algorithms to analyze enzyme reaction datasets, delivering a much more refined view on the functionality of enzymes. Our evaluation leverages the largest enzyme-reaction dataset to date, derived from the SwissProt and Rhea databases with entries up to January 8, 2024. We frame the enzyme-reaction prediction as a retrieval problem, aiming to rank enzymes by their catalytic ability for specific reactions. With our model, we can recruit proteins for novel reactions and predict reactions in novel proteins, facilitating enzyme discovery and function annotation (https://github.com/WillHua127/ReactZyme).

ReactZyme: A Benchmark for Enzyme-Reaction Prediction

TL;DR

ReactZyme tackles enzyme function by predicting catalyzed reactions directly, reframing enzyme annotation as an enzyme-reaction retrieval problem. It builds the largest enzyme-reaction dataset to date from SwissProt and Rhea, and introduces a multi-view representation that combines reaction SMILES/graphs with structure-aware enzyme embeddings derived from protein language models and AlphaFold structures. Across time- and enzyme-similarity splits, 2D/3D graph representations and PLMs deliver strong retrieval performance, while the reaction-similarity split remains particularly challenging, highlighting a key area for methodological advances such as contrastive learning or more sophisticated decoders. The benchmark and baselines establish a foundation for future enzyme discovery and function annotation efforts, with public data and code enabling broader reuse and evaluation.

Abstract

Enzymes, with their specific catalyzed reactions, are necessary for all aspects of life, enabling diverse biological processes and adaptations. Predicting enzyme functions is essential for understanding biological pathways, guiding drug development, enhancing bioproduct yields, and facilitating evolutionary studies. Addressing the inherent complexities, we introduce a new approach to annotating enzymes based on their catalyzed reactions. This method provides detailed insights into specific reactions and is adaptable to newly discovered reactions, diverging from traditional classifications by protein family or expert-derived reaction classes. We employ machine learning algorithms to analyze enzyme reaction datasets, delivering a much more refined view on the functionality of enzymes. Our evaluation leverages the largest enzyme-reaction dataset to date, derived from the SwissProt and Rhea databases with entries up to January 8, 2024. We frame the enzyme-reaction prediction as a retrieval problem, aiming to rank enzymes by their catalytic ability for specific reactions. With our model, we can recruit proteins for novel reactions and predict reactions in novel proteins, facilitating enzyme discovery and function annotation (https://github.com/WillHua127/ReactZyme).
Paper Structure (23 sections, 7 equations, 2 figures, 20 tables)

This paper contains 23 sections, 7 equations, 2 figures, 20 tables.

Figures (2)

  • Figure 1: Overview of the enzyme-reaction prediction task. (a) Illustration of the enzymatic reaction process: substrate binds to the enzyme; formation of the enzyme-substrate complex; release of the product, leaving the enzyme for another catalytic cycle. (b) Current methods for enzyme reaction prediction: Search for annotated enzymes (e.g. sequence-based BLAST altschul1990basic, structure-based FoldSeek van2024fast); prediction of EC/GO annotation (e.g. CLEAN yu2023enzyme); enzyme-reaction prediction (ReactZyme).
  • Figure 2: Our methodology begins with the computation of conformations for structural insights from given reactions. Similarly, for enzymes, we employ AlphaFold to obtain their structures. Then, molecule encoders are used to transcribe 2D molecular graphs alongside their 3D geometry. For the initialization of enzyme features, protein language models are employed. The substrates and products are refined through cross-attention and then merged to form a single reaction representation. Enzyme features are further refined using an equivariant-GNN. These enzyme embeddings, along with reaction embeddings, are processed through an encoder-decoder to establish pair-wise relationships. And, a probability matrix between enzymes and reactions is computed to facilitate retrieval.