Unlocking Potential Binders: Multimodal Pretraining DEL-Fusion for Denoising DNA-Encoded Libraries
Chunbin Gu, Mutian He, Hanqun Cao, Guangyong Chen, Chang-yu Hsieh, Pheng Ann Heng
TL;DR
DEL screening is noisy due to nonspecific interactions, and existing encoders struggle with limited chemical diversity and single-scale representations. The authors present MPDF, a Multimodal Pretraining DEL-Fusion model that combines contrastive pretraining across Graph/Text and ECFP/Text representations with a DEL-Fusion network that fuses atomic, submolecular, and molecular information via bilinear interactions. They pretrain on large biochemical datasets to improve generic feature extraction and define an enrichment parameter $R$ reflecting compound activity, optimized through a Bayesian-inspired loss on $P(R|z)$. Evaluated on three noisy DEL datasets (P, A, OA), MPDF outperforms prior methods in denoising and activity prediction, demonstrating enhanced identification of high-affinity molecules and potential improvements to DEL utility in drug discovery.
Abstract
In the realm of drug discovery, DNA-encoded library (DEL) screening technology has emerged as an efficient method for identifying high-affinity compounds. However, DEL screening faces a significant challenge: noise arising from nonspecific interactions within complex biological systems. Neural networks trained on DEL libraries have been employed to extract compound features, aiming to denoise the data and uncover potential binders to the desired therapeutic target. Nevertheless, the inherent structure of DEL, constrained by the limited diversity of building blocks, impacts the performance of compound encoders. Moreover, existing methods only capture compound features at a single level, further limiting the effectiveness of the denoising strategy. To mitigate these issues, we propose a Multimodal Pretraining DEL-Fusion model (MPDF) that enhances encoder capabilities through pretraining and integrates compound features across various scales. We develop pretraining tasks applying contrastive objectives between different compound representations and their text descriptions, enhancing the compound encoders' ability to acquire generic features. Furthermore, we propose a novel DEL-fusion framework that amalgamates compound information at the atomic, submolecular, and molecular levels, as captured by various compound encoders. The synergy of these innovations equips MPDF with enriched, multi-scale features, enabling comprehensive downstream denoising. Evaluated on three DEL datasets, MPDF demonstrates superior performance in data processing and analysis for validation tasks. Notably, MPDF offers novel insights into identifying high-affinity molecules, paving the way for improved DEL utility in drug discovery.
