Table of Contents
Fetching ...

Deep peak property learning for efficient chiral molecules ECD spectra prediction

Hao Li, Da Long, Li Yuan, Yonghong Tian, Xinchang Wang, Fanyang Mo

TL;DR

This work tackles the costly prediction of electronic circular dichroism spectra for chiral molecules by introducing CMCDS, a large-scale dataset of computed ECD spectra for 22,190 molecules, and ECDFormer, a Transformer-based model that predicts peak properties (number, position, symbol) from a GeoGNN-derived molecular representation and renders the full ECD spectrum from those peaks. The peak-focused approach, with a dedicated loss and peak-specific metrics, yields superior accuracy on peak-number, peak-position, and peak-symbol predictions compared with traditional machine-learning and deep-learning baselines, while dramatically accelerating spectrum generation. The method enables rapid chiral-molecule assignation and high-throughput screening, with potential impact on asymmetric synthesis and pharmaceutical development; limitations include bypassing conformational searches and focusing on single-chiral-center molecules, suggesting directions to handle conformational ensembles and multi-center chirality in future work.

Abstract

Chiral molecule assignation is crucial for asymmetric catalysis, functional materials, and the drug industry. The conventional approach requires theoretical calculations of electronic circular dichroism (ECD) spectra, which is time-consuming and costly. To speed up this process, we have incorporated deep learning techniques for the ECD prediction. We first set up a large-scale dataset of Chiral Molecular ECD spectra (CMCDS) with calculated ECD spectra. We further develop the ECDFormer model, a Transformer-based model to learn the chiral molecular representations and predict corresponding ECD spectra with improved efficiency and accuracy. Unlike other models for spectrum prediction, our ECDFormer creatively focused on peak properties rather than the whole spectrum sequence for prediction, inspired by the scenario of chiral molecule assignation. Specifically, ECDFormer predicts the peak properties, including number, position, and symbol, then renders the ECD spectra from these peak properties, which significantly outperforms other models in ECD prediction, Our ECDFormer reduces the time of acquiring ECD spectra from 1-100 hours per molecule to 1.5s.

Deep peak property learning for efficient chiral molecules ECD spectra prediction

TL;DR

This work tackles the costly prediction of electronic circular dichroism spectra for chiral molecules by introducing CMCDS, a large-scale dataset of computed ECD spectra for 22,190 molecules, and ECDFormer, a Transformer-based model that predicts peak properties (number, position, symbol) from a GeoGNN-derived molecular representation and renders the full ECD spectrum from those peaks. The peak-focused approach, with a dedicated loss and peak-specific metrics, yields superior accuracy on peak-number, peak-position, and peak-symbol predictions compared with traditional machine-learning and deep-learning baselines, while dramatically accelerating spectrum generation. The method enables rapid chiral-molecule assignation and high-throughput screening, with potential impact on asymmetric synthesis and pharmaceutical development; limitations include bypassing conformational searches and focusing on single-chiral-center molecules, suggesting directions to handle conformational ensembles and multi-center chirality in future work.

Abstract

Chiral molecule assignation is crucial for asymmetric catalysis, functional materials, and the drug industry. The conventional approach requires theoretical calculations of electronic circular dichroism (ECD) spectra, which is time-consuming and costly. To speed up this process, we have incorporated deep learning techniques for the ECD prediction. We first set up a large-scale dataset of Chiral Molecular ECD spectra (CMCDS) with calculated ECD spectra. We further develop the ECDFormer model, a Transformer-based model to learn the chiral molecular representations and predict corresponding ECD spectra with improved efficiency and accuracy. Unlike other models for spectrum prediction, our ECDFormer creatively focused on peak properties rather than the whole spectrum sequence for prediction, inspired by the scenario of chiral molecule assignation. Specifically, ECDFormer predicts the peak properties, including number, position, and symbol, then renders the ECD spectra from these peak properties, which significantly outperforms other models in ECD prediction, Our ECDFormer reduces the time of acquiring ECD spectra from 1-100 hours per molecule to 1.5s.
Paper Structure (23 sections, 6 equations, 8 figures, 2 tables)

This paper contains 23 sections, 6 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: The scheme for ECD prediction and chiral molecule assignation.a Thalidomide has two configurations (R/S). R-Thalidomide induces sedative effects, whereas S-Thalidomide is associated with teratogenic effects. b ECD comparison is most frequently employed for assigning the absolute configuration. However, The theoretical calculation of ECD is time-consuming, involving steps such as conformational searching, conformational optimization, excited-state property calculation, and Boltzmann weighting. So we employ deep learning for acceleration. c As molecules become more complex, the computation time increases. Our CPU version is IntelXeonE5-2640v4@2.40GHz.
  • Figure 2: The generation pipeline for our chiral molecular CD spectra dataset (CMCDS) for ECD prediction task.
  • Figure 3: The General Pipeline of our ECDFormer model. The design of the peak property learning and prediction modules is inspired by the chemical chiral assignation procedure. By predicting peak properties and rendering ECD spectra, ECDFormer outperforms baselines in the ECD spectra prediction task.
  • Figure 4: The performance comparison between ECDFormer and baselines for ECD prediction.a The data distribution plot for the ground-truth peak number and their predicted number. b The violin plot of the discrepancies in peak positions between ground-truth ECD and predicted ECD from ECDFormer and baselines. $N_{v}$ is the peak number, representing the difficulty of cases. c The violin plot of the discrepancies in peak symbols between ground-truth ECD and predicted ECD from ECDFormer and baselines.
  • Figure 5: Visualization of ECD spectra predictions from ECDFormer. We visualize the ground-truth spectra and ECDFormer's prediction spectra of the selected molecules from the test split of the CMCDS dataset.
  • ...and 3 more figures