EnzyCLIP: A Cross-Attention Dual Encoder Framework with Contrastive Learning for Predicting Enzyme Kinetic Constants
Anas Aziz Khan, Md Shah Fahad, Priyanka, Ramesh Chandra, Guransh Singh
TL;DR
EnzyCLIP introduces a CLIP-inspired cross-attention dual-encoder that jointly predicts enzyme turnover ($K_{cat}$) and Michaelis constant ($K_m$) from protein sequences and substrate SMILES. By combining frozen ESM-2 protein embeddings with ChemBERTa chemical representations in a bidirectional cross-attention architecture and training with InfoNCE contrastive loss plus a SmoothL1 regression objective on $\log_{10}$ transformed targets, the model learns aligned multimodal representations. On CatPred-DB data, it achieves competitive test $R^2$ values (~0.59–0.61) for both parameters, with Km slightly outperforming Kcat and further gains from XGBoost ensembles on the learned embeddings. The results demonstrate the value of multimodal integration for enzyme kinetics, provide interpretable insights via SHAP analyses, and offer a lightweight, scalable framework suitable for enzyme engineering and high-throughput screening. The work highlights distinct mechanistic signals for catalysis and binding, as evidenced by length- and EC-class-dependent performance patterns, and sets the stage for further enhancements through structure, dynamics, and phylogenetic information.
Abstract
Accurate prediction of enzyme kinetic parameters is crucial for drug discovery, metabolic engineering, and synthetic biology applications. Current computational approaches face limitations in capturing complex enzyme-substrate interactions and often focus on single parameters while neglecting the joint prediction of catalytic turnover numbers (Kcat) and Michaelis-Menten constants (Km). We present EnzyCLIP, a novel dual-encoder framework that leverages contrastive learning and cross-attention mechanisms to predict enzyme kinetic parameters from protein sequences and substrate molecular structures. Our approach integrates ESM-2 protein language model embeddings with ChemBERTa chemical representations through a CLIP-inspired architecture enhanced with bidirectional cross-attention for dynamic enzyme-substrate interaction modeling. EnzyCLIP combines InfoNCE contrastive loss with Huber regression loss to learn aligned multimodal representations while predicting log10-transformed kinetic parameters. The model is trained on the CatPred-DB database containing 23,151 Kcat and 41,174 Km experimentally validated measurements, and achieved competitive performance with R2 scores of 0.593 for Kcat and 0.607 for Km prediction. XGBoost ensemble methods applied to the learned embeddings further improved Km prediction (R2 = 0.61) while maintaining robust Kcat performance.
