Table of Contents
Fetching ...

De Novo Generation of Hit-like Molecules from Gene Expression Profiles via Deep Learning

Chen Li, Yoshihiro Yamanishi

TL;DR

This work tackles de novo hit-like molecule design by conditioning chemical generation on gene expression responses. It presents HVL2Mol, a two-stage framework that uses a VAE to extract latent biological features from expression profiles and a conditional LSTM to generate SMILES strings guided by those features. The approach yields valid, unique, and novel molecules with preserved drug-likeness (QED) and synthesizability (SA), and case studies show therapeutic relevance against gastric cancer, dermatitis, and Alzheimer's disease through disease-reversal profiles. The method advances omics-guided drug design by directly translating cellular response signals into chemically plausible candidates, with potential to accelerate lead discovery.

Abstract

De novo generation of hit-like molecules is a challenging task in the drug discovery process. Most methods in previous studies learn the semantics and syntax of molecular structures by analyzing molecular graphs or simplified molecular input line entry system (SMILES) strings; however, they do not take into account the drug responses of the biological systems consisting of genes and proteins. In this study we propose a hybrid neural network, HNN2Mol, which utilizes gene expression profiles to generate molecular structures with desirable phenotypes for arbitrary target proteins. In the algorithm, a variational autoencoder is employed as a feature extractor to learn the latent feature distribution of the gene expression profiles. Then, a long short-term memory is leveraged as the chemical generator to produce syntactically valid SMILES strings that satisfy the feature conditions of the gene expression profile extracted by the feature extractor. Experimental results and case studies demonstrate that the proposed HNN2Mol model can produce new molecules with potential bioactivities and drug-like properties.

De Novo Generation of Hit-like Molecules from Gene Expression Profiles via Deep Learning

TL;DR

This work tackles de novo hit-like molecule design by conditioning chemical generation on gene expression responses. It presents HVL2Mol, a two-stage framework that uses a VAE to extract latent biological features from expression profiles and a conditional LSTM to generate SMILES strings guided by those features. The approach yields valid, unique, and novel molecules with preserved drug-likeness (QED) and synthesizability (SA), and case studies show therapeutic relevance against gastric cancer, dermatitis, and Alzheimer's disease through disease-reversal profiles. The method advances omics-guided drug design by directly translating cellular response signals into chemically plausible candidates, with potential to accelerate lead discovery.

Abstract

De novo generation of hit-like molecules is a challenging task in the drug discovery process. Most methods in previous studies learn the semantics and syntax of molecular structures by analyzing molecular graphs or simplified molecular input line entry system (SMILES) strings; however, they do not take into account the drug responses of the biological systems consisting of genes and proteins. In this study we propose a hybrid neural network, HNN2Mol, which utilizes gene expression profiles to generate molecular structures with desirable phenotypes for arbitrary target proteins. In the algorithm, a variational autoencoder is employed as a feature extractor to learn the latent feature distribution of the gene expression profiles. Then, a long short-term memory is leveraged as the chemical generator to produce syntactically valid SMILES strings that satisfy the feature conditions of the gene expression profile extracted by the feature extractor. Experimental results and case studies demonstrate that the proposed HNN2Mol model can produce new molecules with potential bioactivities and drug-like properties.
Paper Structure (19 sections, 5 equations, 12 figures, 3 tables, 1 algorithm)

This paper contains 19 sections, 5 equations, 12 figures, 3 tables, 1 algorithm.

Figures (12)

  • Figure 1: Architecture of the proposed HVL2Mol model. (A) A VAE is trained to extract the biological features of gene expression profiles. Here, a VAE encoder attempts to extract the latent feature vector of a gene expression profile, and a VAE decoder attempts to reconstruct the gene expression profile from the latent vector. (B) After the VAE training, the latent vector is utilized as a condition to an LSTM to generate SMILES strings. An extracted latent vector and a vector representation of a start token are concatenated to generate the first atom of a SMILES string. Then, the generated atom and the condition generate the next atom iteratively. This iterative process ends when the defined end token (i.e., $<$EOS$>$) is generated. Finally, all atoms are combined to form a SMILES string. The newly generated SMILES string can be used as a candidate molecule for hit identification to treat diseases.
  • Figure 2: Distribution of fold change values in the gene expression profile of the molecule “C17H25ClN2O3" exposed in the MCF7 cell. The original gene expression profile of “C17H25ClN2O3" (green) and the reconstructed gene expression profiles (red) have similar distributions.
  • Figure 3: Distribution of fold change values in the average gene expression profile of all molecules exposed in the MCF7 cell. The original gene expression profiles of the training set (green) and the reconstructed gene expression profiles (red) have similar distributions.
  • Figure 4: Training loss and the ratio of valid molecules generated by the proposed HVL2Mol. The red curve indicates the training loss value of the LSTM with the training epochs. The green curve denotes the ratio of valid molecules generated by the LSTM with the training epochs. Note that the valid molecules are examined by the RDKit tool.
  • Figure 5: Violin plots of QED scores for molecules from the training dataset and proposed HVL2Mol.
  • ...and 7 more figures