Table of Contents
Fetching ...

Symmetry-Constrained Generation of Diverse Low-Bandgap Molecules with Monte Carlo Tree Search

Akshay Subramanian, James Damewood, Juno Nam, Kevin P. Greenman, Avni P. Singhal, Rafael Gómez-Bombarelli

TL;DR

This work tackles the challenge of designing low-bandgap, NIR-sensitive organic molecules with practical synthetic accessibility. It introduces a symmetry-aware fragment decomposition pipeline derived from patent-mined data and a fragment-constrained Monte Carlo Tree Search (MCTS) generator that preserves reactive-position symmetry. A Chemprop reward predictor trained with TD-DFT labels, augmented by active learning, guides the search toward molecules with lower $E_g$ while maintaining chemical diversity. TD-DFT validation shows red-shifted absorption for generated candidates, demonstrating the approach's ability to produce diverse, synthesizable designs with potential impact for organic electronics and NIR applications; data and code are publicly available for reuse and further development.

Abstract

Organic optoelectronic materials are a promising avenue for next-generation electronic devices due to their solution processability, mechanical flexibility, and tunable electronic properties. In particular, near-infrared (NIR) sensitive molecules have unique applications in night-vision equipment and biomedical imaging. Molecular engineering has played a crucial role in developing non-fullerene acceptors (NFAs) such as the Y-series molecules, which have significantly improved the power conversion efficiency (PCE) of solar cells and enhanced spectral coverage in the NIR region. However, systematically designing molecules with targeted optoelectronic properties while ensuring synthetic accessibility remains a challenge. To address this, we leverage structural priors from domain-focused, patent-mined datasets of organic electronic molecules using a symmetry-aware fragment decomposition algorithm and a fragment-constrained Monte Carlo Tree Search (MCTS) generator. Our approach generates candidates that retain symmetry constraints from the patent dataset, while also exhibiting red-shifted absorption, as validated by TD-DFT calculations.

Symmetry-Constrained Generation of Diverse Low-Bandgap Molecules with Monte Carlo Tree Search

TL;DR

This work tackles the challenge of designing low-bandgap, NIR-sensitive organic molecules with practical synthetic accessibility. It introduces a symmetry-aware fragment decomposition pipeline derived from patent-mined data and a fragment-constrained Monte Carlo Tree Search (MCTS) generator that preserves reactive-position symmetry. A Chemprop reward predictor trained with TD-DFT labels, augmented by active learning, guides the search toward molecules with lower while maintaining chemical diversity. TD-DFT validation shows red-shifted absorption for generated candidates, demonstrating the approach's ability to produce diverse, synthesizable designs with potential impact for organic electronics and NIR applications; data and code are publicly available for reuse and further development.

Abstract

Organic optoelectronic materials are a promising avenue for next-generation electronic devices due to their solution processability, mechanical flexibility, and tunable electronic properties. In particular, near-infrared (NIR) sensitive molecules have unique applications in night-vision equipment and biomedical imaging. Molecular engineering has played a crucial role in developing non-fullerene acceptors (NFAs) such as the Y-series molecules, which have significantly improved the power conversion efficiency (PCE) of solar cells and enhanced spectral coverage in the NIR region. However, systematically designing molecules with targeted optoelectronic properties while ensuring synthetic accessibility remains a challenge. To address this, we leverage structural priors from domain-focused, patent-mined datasets of organic electronic molecules using a symmetry-aware fragment decomposition algorithm and a fragment-constrained Monte Carlo Tree Search (MCTS) generator. Our approach generates candidates that retain symmetry constraints from the patent dataset, while also exhibiting red-shifted absorption, as validated by TD-DFT calculations.

Paper Structure

This paper contains 27 sections, 10 equations, 8 figures.

Figures (8)

  • Figure 1: Fragments to initialize MCTS (a) Editable Y6 core with marked positions. (b) Fragment decomposition algorithm to obtain fragments from patent dataset. Vocabulary dictionary is first created by breaking one-bond at a time, and labeling the resulting fragments with unique integer values. Recursive decomposition of the starting molecule is then performed. Reactive positions are labeled with values corresponding to the broken fragment's identity. The leaf nodes (shaded in green) represent the final set of fragments obtained after decomposition. Me, Th, and Ph represent methyl, thiophene, and phenyl groups, respectively.
  • Figure 2: Features of MCTS training and DFT validation (a) Bandgap and similarity penalty as a function of iteration. Values shown are the weighted contributions to the reward function. With every iteration, the constrained optimization becomes more challenging resulting in less optimal candidates. (b) Representative reward evolution during training for one MCTS repetition. (c) Pruning of tree during training. The tree is expanded to contain deeper choices such as end-groups and side-chains only if their rewards are promising. This can be seen by shift towards lower predicted bandgaps as we traverse deeper down the tree. (d) PCA of randomly sampled and property-optimized molecules obtained from MCTS, and some popular experimental candidates. PCA is performed on Morgan fingerprints of molecules, and colors of random samples are based on Chemprop-predicted bandgaps. It can be seen that the random samples are diverse chemically and also span a range of property values, while the MCTS optimized candidates are diverse but concentrated at lower bandgap locations of the landscape. a, b, c, and d are plotted on Y6 MDP. (e) Histograms showing shifts in DFT histograms in comparison to training datasets. It can be seen that MCTS-optimized Y6 and patent molecules exhibit a large shift towards lower bandgaps compared to the patent data distribution.
  • Figure 3: Active Learning to improve reward prediction Scatter plots showing improvement in fit with AL iterations for (a) Y6 derivatives (b) Patent-extracted fragment derivatives. The test data in (a) and (b) are the final sets of 100 molecules generated after all AL iterations have been completed with the Y6 and patent-fragment MDPs respectively. In (a) and (b), the plot i corresponds to the model pre-trained on just patent-mined dataset, and plot ii corresponds to model trained on patent dataset + random rollouts from MCTS. Plot iii in (b) corresponds to model trained on patent dataset + random rollouts + diversity & EI acquisition samples. More details are given in Section \ref{['AL']}.
  • Figure 4: Final generated candidates and their TD-DFT bandgaps K-means clustering was performed on final 100 molecules from each category into 5 clusters (based on Morgan fingerprints). The molecules shown are the lowest-bandgap molecules chosen from each cluster. (a) Y6-derivatives MDP, (b) Patent-extracted fragments MDP. Bandgap of the last molecule is denoted with asterisk (*) because the geometry had to be fixed before the TD-DFT calculation was performed. More details are provided in Section \ref{['fixes']} of SI.
  • Figure 5: Computed absorption spectra for final candidate molecules. The spectra for molecules are ordered as shown in Figure 4 in the main text, with Y6 derivatives in the upper row and patent-extracted fragment designs in the lower row. Solid lines represent single-point TD-DFT calculations at the optimized geometry, while dotted lines represent statistically averaged TD-DFT spectra from molecular clusters sampled from MD simulations. The translucent vertical lines indicate the band gaps from the single-point calculations. The spectra are normalized so that the maximum absorption corresponds to 1.0.
  • ...and 3 more figures