Table of Contents
Fetching ...

EnzyCLIP: A Cross-Attention Dual Encoder Framework with Contrastive Learning for Predicting Enzyme Kinetic Constants

Anas Aziz Khan, Md Shah Fahad, Priyanka, Ramesh Chandra, Guransh Singh

TL;DR

EnzyCLIP introduces a CLIP-inspired cross-attention dual-encoder that jointly predicts enzyme turnover ($K_{cat}$) and Michaelis constant ($K_m$) from protein sequences and substrate SMILES. By combining frozen ESM-2 protein embeddings with ChemBERTa chemical representations in a bidirectional cross-attention architecture and training with InfoNCE contrastive loss plus a SmoothL1 regression objective on $\log_{10}$ transformed targets, the model learns aligned multimodal representations. On CatPred-DB data, it achieves competitive test $R^2$ values (~0.59–0.61) for both parameters, with Km slightly outperforming Kcat and further gains from XGBoost ensembles on the learned embeddings. The results demonstrate the value of multimodal integration for enzyme kinetics, provide interpretable insights via SHAP analyses, and offer a lightweight, scalable framework suitable for enzyme engineering and high-throughput screening. The work highlights distinct mechanistic signals for catalysis and binding, as evidenced by length- and EC-class-dependent performance patterns, and sets the stage for further enhancements through structure, dynamics, and phylogenetic information.

Abstract

Accurate prediction of enzyme kinetic parameters is crucial for drug discovery, metabolic engineering, and synthetic biology applications. Current computational approaches face limitations in capturing complex enzyme-substrate interactions and often focus on single parameters while neglecting the joint prediction of catalytic turnover numbers (Kcat) and Michaelis-Menten constants (Km). We present EnzyCLIP, a novel dual-encoder framework that leverages contrastive learning and cross-attention mechanisms to predict enzyme kinetic parameters from protein sequences and substrate molecular structures. Our approach integrates ESM-2 protein language model embeddings with ChemBERTa chemical representations through a CLIP-inspired architecture enhanced with bidirectional cross-attention for dynamic enzyme-substrate interaction modeling. EnzyCLIP combines InfoNCE contrastive loss with Huber regression loss to learn aligned multimodal representations while predicting log10-transformed kinetic parameters. The model is trained on the CatPred-DB database containing 23,151 Kcat and 41,174 Km experimentally validated measurements, and achieved competitive performance with R2 scores of 0.593 for Kcat and 0.607 for Km prediction. XGBoost ensemble methods applied to the learned embeddings further improved Km prediction (R2 = 0.61) while maintaining robust Kcat performance.

EnzyCLIP: A Cross-Attention Dual Encoder Framework with Contrastive Learning for Predicting Enzyme Kinetic Constants

TL;DR

EnzyCLIP introduces a CLIP-inspired cross-attention dual-encoder that jointly predicts enzyme turnover () and Michaelis constant () from protein sequences and substrate SMILES. By combining frozen ESM-2 protein embeddings with ChemBERTa chemical representations in a bidirectional cross-attention architecture and training with InfoNCE contrastive loss plus a SmoothL1 regression objective on transformed targets, the model learns aligned multimodal representations. On CatPred-DB data, it achieves competitive test values (~0.59–0.61) for both parameters, with Km slightly outperforming Kcat and further gains from XGBoost ensembles on the learned embeddings. The results demonstrate the value of multimodal integration for enzyme kinetics, provide interpretable insights via SHAP analyses, and offer a lightweight, scalable framework suitable for enzyme engineering and high-throughput screening. The work highlights distinct mechanistic signals for catalysis and binding, as evidenced by length- and EC-class-dependent performance patterns, and sets the stage for further enhancements through structure, dynamics, and phylogenetic information.

Abstract

Accurate prediction of enzyme kinetic parameters is crucial for drug discovery, metabolic engineering, and synthetic biology applications. Current computational approaches face limitations in capturing complex enzyme-substrate interactions and often focus on single parameters while neglecting the joint prediction of catalytic turnover numbers (Kcat) and Michaelis-Menten constants (Km). We present EnzyCLIP, a novel dual-encoder framework that leverages contrastive learning and cross-attention mechanisms to predict enzyme kinetic parameters from protein sequences and substrate molecular structures. Our approach integrates ESM-2 protein language model embeddings with ChemBERTa chemical representations through a CLIP-inspired architecture enhanced with bidirectional cross-attention for dynamic enzyme-substrate interaction modeling. EnzyCLIP combines InfoNCE contrastive loss with Huber regression loss to learn aligned multimodal representations while predicting log10-transformed kinetic parameters. The model is trained on the CatPred-DB database containing 23,151 Kcat and 41,174 Km experimentally validated measurements, and achieved competitive performance with R2 scores of 0.593 for Kcat and 0.607 for Km prediction. XGBoost ensemble methods applied to the learned embeddings further improved Km prediction (R2 = 0.61) while maintaining robust Kcat performance.

Paper Structure

This paper contains 25 sections, 27 equations, 24 figures, 2 tables.

Figures (24)

  • Figure 1: Comprehensive $K_{\text{cat}}$ dataset analysis. Distribution of enzyme sequence lengths showing mean of 430 and median of 377 amino acids (top left), sample distribution across EC classes with EC 3 most represented (top right), $K_{\text{cat}}$ value distribution with mean 0.96 and range -6 to 6 (bottom left), weak correlation between sequence length and $K_{\text{cat}}$ ($r=0.027$, bottom center), and substrate SMILES length distribution (bottom right). Summary statistics table shows 23,151 total samples from 7,177 unique enzymes.
  • Figure 2: Dataset distribution comparisons for $K_{\text{cat}}$. Three-panel visualization showing (left) $\log_{10}(K_{\text{cat}})$ distribution with normal characteristics, (center) protein sequence length distribution peaking at 300--500 amino acids, and (right) SMILES length distribution heavily left-skewed with most substrates below 100 characters.
  • Figure 3: Training dynamics for $K_{\text{cat}}$ prediction. (Top left) Training loss decreasing from 0.78 to 0.16 over 25 epochs. (Top right) Validation $R^2$ peaking at 0.5729 at epoch 10. (Bottom left) Validation RMSE reaching minimum of 1.071. (Bottom right) Validation MAE achieving best value of 0.737 at epoch 10.
  • Figure 4: EnzyCLIP $K_{\text{cat}}$ prediction performance across sequence lengths. (Top left) Box plots showing $K_{\text{cat}}$ distribution across six sequence length bins with medians around 1.0. (Top right) Mean $K_{\text{cat}}$ comparison between true and predicted values across length ranges. (Bottom left) RMSE values ranging from 0.987 to 1.192 across length bins. (Bottom right) $R^2$ scores showing optimal performance for 0--200 (0.649) and 200--400 (0.641) ranges, with degraded performance for 800--1000 (0.354) and 1000+ (0.441) amino acids.
  • Figure 5: EC class-based performance analysis for $K_{\text{cat}}$ prediction. (Top left) $R^2$ scores by EC class with EC 5 highest at 0.652. (Top right) RMSE by EC class ranging from 1.017 to 1.161. (Bottom left) MAE by EC class spanning 0.685 to 0.857. (Bottom right) Pearson correlation coefficients showing EC 5 highest at 0.818.
  • ...and 19 more figures