Table of Contents
Fetching ...

ChemFixer: Correcting Invalid Molecules to Unlock Previously Unseen Chemical Space

Jun-Hyoung Park, Ho-Jun Song, Seong-Whan Lee

TL;DR

ChemFixer tackles the problem of invalid SMILES in deep learning molecular generation by introducing a transformer-based framework with masked pre-training. It is trained on a large-scale dataset of valid/invalid molecule pairs derived from MOSES and fine-tuned to correct invalid outputs while preserving the original chemical distribution, enabling expansion of accessible chemical space. Across multiple generative models and a DTI task (Co-VAE on KIBA), ChemFixer substantially improves validity (e.g., >30% in downstream ligands) and recovers promising candidate molecules, while maintaining distributional similarity as shown by FCD and SNN analyses. The work demonstrates strong generalization, data-efficiency advantages from masking pre-training, and practical impact for drug discovery, with future directions including 3D conformer validation and broader downstream applications.

Abstract

Deep learning-based molecular generation models have shown great potential in efficiently exploring vast chemical spaces by generating potential drug candidates with desired properties. However, these models often produce chemically invalid molecules, which limits the usable scope of the learned chemical space and poses significant challenges for practical applications. To address this issue, we propose ChemFixer, a framework designed to correct invalid molecules into valid ones. ChemFixer is built on a transformer architecture, pre-trained using masking techniques, and fine-tuned on a large-scale dataset of valid/invalid molecular pairs that we constructed. Through comprehensive evaluations across diverse generative models, ChemFixer improved molecular validity while effectively preserving the chemical and biological distributional properties of the original outputs. This indicates that ChemFixer can recover molecules that could not be previously generated, thereby expanding the diversity of potential drug candidates. Furthermore, ChemFixer was effectively applied to a drug-target interaction (DTI) prediction task using limited data, improving the validity of generated ligands and discovering promising ligand-protein pairs. These results suggest that ChemFixer is not only effective in data-limited scenarios, but also extensible to a wide range of downstream tasks. Taken together, ChemFixer shows promise as a practical tool for various stages of deep learning-based drug discovery, enhancing molecular validity and expanding accessible chemical space.

ChemFixer: Correcting Invalid Molecules to Unlock Previously Unseen Chemical Space

TL;DR

ChemFixer tackles the problem of invalid SMILES in deep learning molecular generation by introducing a transformer-based framework with masked pre-training. It is trained on a large-scale dataset of valid/invalid molecule pairs derived from MOSES and fine-tuned to correct invalid outputs while preserving the original chemical distribution, enabling expansion of accessible chemical space. Across multiple generative models and a DTI task (Co-VAE on KIBA), ChemFixer substantially improves validity (e.g., >30% in downstream ligands) and recovers promising candidate molecules, while maintaining distributional similarity as shown by FCD and SNN analyses. The work demonstrates strong generalization, data-efficiency advantages from masking pre-training, and practical impact for drug discovery, with future directions including 3D conformer validation and broader downstream applications.

Abstract

Deep learning-based molecular generation models have shown great potential in efficiently exploring vast chemical spaces by generating potential drug candidates with desired properties. However, these models often produce chemically invalid molecules, which limits the usable scope of the learned chemical space and poses significant challenges for practical applications. To address this issue, we propose ChemFixer, a framework designed to correct invalid molecules into valid ones. ChemFixer is built on a transformer architecture, pre-trained using masking techniques, and fine-tuned on a large-scale dataset of valid/invalid molecular pairs that we constructed. Through comprehensive evaluations across diverse generative models, ChemFixer improved molecular validity while effectively preserving the chemical and biological distributional properties of the original outputs. This indicates that ChemFixer can recover molecules that could not be previously generated, thereby expanding the diversity of potential drug candidates. Furthermore, ChemFixer was effectively applied to a drug-target interaction (DTI) prediction task using limited data, improving the validity of generated ligands and discovering promising ligand-protein pairs. These results suggest that ChemFixer is not only effective in data-limited scenarios, but also extensible to a wide range of downstream tasks. Taken together, ChemFixer shows promise as a practical tool for various stages of deep learning-based drug discovery, enhancing molecular validity and expanding accessible chemical space.

Paper Structure

This paper contains 25 sections, 19 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: In SMILES strings, uppercase and lowercase letters, as well as branch symbols, differentiate the molecules. Uppercase C represents a non-aromatic carbon, while lowercase c represents an aromatic carbon, differentiating (a) and (b). The branch symbol indicates the position of the side chain, differentiating (a) and (c).
  • Figure 2: Illustrations of (a) the base molecular generative model framework, (b) the process of autoregressively generating SMILES strings, and (c) valid/invalid molecular pairs generated during the training of (a).
  • Figure 3: Illustration of ChemFixer: pre-training with a 10% masking ratio followed by fine-tuning using valid/invalid molecular pairs.
  • Figure 4: Distributional comparison among 5,000 molecules each from the MOSES test data, model-generated data, and ChemFixer-corrected data. Distributions were visualized using PCA and KDE along the first principal component.
  • Figure 5: Example of a ChemFixer-corrected molecule with a different pharmacology but the same Bemis–Murcko scaffold.
  • ...and 2 more figures