Table of Contents
Fetching ...

Generating $π$-Functional Molecules Using STGG+ with Active Learning

Alexia Jolicoeur-Martineau, Yan Zhang, Boris Knyazev, Aristide Baratin, Cheng-Hao Liu

TL;DR

This work tackles the challenge of designing π-conjugated molecules with out-of-distribution optoelectronic properties by integrating STGG+ with an active-learning loop (STGG+AL). Starting from a 2.9 million-molecule Conjugated-xTB dataset, STGG+AL iteratively generates, evaluates via sTDA-xTB, and fine-tunes on newly labeled data to push toward higher oscillator strength $f_{osc}$ and targeted absorption wavelengths, with TD-DFT used for validation. The approach achieves substantially higher $f_{osc}$ values (up to about 27.7 for general maximization and up to 2.44 under NIR constraints) than strong baselines like GraphGA and REINVENT4, while producing chemically sound scaffolds. The open-source code and dataset enable broader adoption and adaptation to other out-of-distribution molecular design objectives, highlighting a practical, sample-efficient path for discovering high-performance optoelectronic materials.

Abstract

Generating novel molecules with out-of-distribution properties is a major challenge in molecular discovery. While supervised learning methods generate high-quality molecules similar to those in a dataset, they struggle to generalize to out-of-distribution properties. Reinforcement learning can explore new chemical spaces but often conducts 'reward-hacking' and generates non-synthesizable molecules. In this work, we address this problem by integrating a state-of-the-art supervised learning method, STGG+, in an active learning loop. Our approach iteratively generates, evaluates, and fine-tunes STGG+ to continuously expand its knowledge. We denote this approach STGG+AL. We apply STGG+AL to the design of organic $π$-functional materials, specifically two challenging tasks: 1) generating highly absorptive molecules characterized by high oscillator strength and 2) designing absorptive molecules with reasonable oscillator strength in the near-infrared (NIR) range. The generated molecules are validated and rationalized in-silico with time-dependent density functional theory. Our results demonstrate that our method is highly effective in generating novel molecules with high oscillator strength, contrary to existing methods such as reinforcement learning (RL) methods. We open-source our active-learning code along with our Conjugated-xTB dataset containing 2.9 million $π$-conjugated molecules and the function for approximating the oscillator strength and absorption wavelength (based on sTDA-xTB).

Generating $π$-Functional Molecules Using STGG+ with Active Learning

TL;DR

This work tackles the challenge of designing π-conjugated molecules with out-of-distribution optoelectronic properties by integrating STGG+ with an active-learning loop (STGG+AL). Starting from a 2.9 million-molecule Conjugated-xTB dataset, STGG+AL iteratively generates, evaluates via sTDA-xTB, and fine-tunes on newly labeled data to push toward higher oscillator strength and targeted absorption wavelengths, with TD-DFT used for validation. The approach achieves substantially higher values (up to about 27.7 for general maximization and up to 2.44 under NIR constraints) than strong baselines like GraphGA and REINVENT4, while producing chemically sound scaffolds. The open-source code and dataset enable broader adoption and adaptation to other out-of-distribution molecular design objectives, highlighting a practical, sample-efficient path for discovering high-performance optoelectronic materials.

Abstract

Generating novel molecules with out-of-distribution properties is a major challenge in molecular discovery. While supervised learning methods generate high-quality molecules similar to those in a dataset, they struggle to generalize to out-of-distribution properties. Reinforcement learning can explore new chemical spaces but often conducts 'reward-hacking' and generates non-synthesizable molecules. In this work, we address this problem by integrating a state-of-the-art supervised learning method, STGG+, in an active learning loop. Our approach iteratively generates, evaluates, and fine-tunes STGG+ to continuously expand its knowledge. We denote this approach STGG+AL. We apply STGG+AL to the design of organic -functional materials, specifically two challenging tasks: 1) generating highly absorptive molecules characterized by high oscillator strength and 2) designing absorptive molecules with reasonable oscillator strength in the near-infrared (NIR) range. The generated molecules are validated and rationalized in-silico with time-dependent density functional theory. Our results demonstrate that our method is highly effective in generating novel molecules with high oscillator strength, contrary to existing methods such as reinforcement learning (RL) methods. We open-source our active-learning code along with our Conjugated-xTB dataset containing 2.9 million -conjugated molecules and the function for approximating the oscillator strength and absorption wavelength (based on sTDA-xTB).

Paper Structure

This paper contains 17 sections, 13 figures, 2 tables.

Figures (13)

  • Figure 1: Maximizing $f_\text{osc}$. STGG+ with active learning generates strong out-of-distribution (OOD) molecules.
  • Figure 2: Maximizing $f_\text{osc}$ in the short-wave infrared range. STGG+ with active learning generates strong OOD molecules.
  • Figure 3: Case study of the top-1 molecule with the highest $f_\text{osc}$.
  • Figure 4: Maximizing $f_\text{osc}$ using active learning with constraints: max 70 heavy atoms, max ring-size of 6. STGG+ (top-1, top-10, top-100; from a single run) vs GraphGA (top-1; average and 95% confidence interval over 3 runs).
  • Figure 5: Case study of the top-1 molecule with NIR absorption but the highest $f_\text{osc}$.
  • ...and 8 more figures