Table of Contents
Fetching ...

CycleChemist: A Dual-Pronged Machine Learning Framework for Organic Photovoltaic Discovery

Hou Hei Lam, Jiangjie Qiu, Xiuyuan Hu, Wentao Li, Fankun Zeng, Siwei Fu, Hao Zhang, Xiaonan Wang

TL;DR

CycleChemist addresses the challenge of jointly designing donor and acceptor materials for OPVs by integrating a predictive core (OPVC, MOE2, P3) with a generative agent (MatGPT) trained via reinforcement learning. It introduces OPV2D, a curated dataset of 2,000 donor–acceptor pairs, and demonstrates high predictive accuracy for OPV activity ($P(\text{OPV}|x)$), frontier orbital energies ($\hat{y}_{\text{HOMO}}$, $\hat{y}_{\text{LUMO}}$), and PCE ($\hat{y}_{\text{PCE}}$). The generative component achieves high validity and novelty on MOSES benchmarks, and RL-guided design yields donor–acceptor pairs with spectral complementarity validated by DFT/TD-DFT. The results showcase a scalable, data-driven pipeline for rapid OPV screening and materials discovery with practical implications for sustainable energy.

Abstract

Organic photovoltaic (OPV) materials offer a promising path toward sustainable energy generation, but their development is limited by the difficulty of identifying high performance donor and acceptor pairs with strong power conversion efficiencies (PCEs). Existing design strategies typically focus on either the donor or the acceptor alone, rather than using a unified approach capable of modeling both components. In this work, we introduce a dual machine learning framework for OPV discovery that combines predictive modeling with generative molecular design. We present the Organic Photovoltaic Donor Acceptor Dataset (OPV2D), the largest curated dataset of its kind, containing 2000 experimentally characterized donor acceptor pairs. Using this dataset, we develop the Organic Photovoltaic Classifier (OPVC) to predict whether a material exhibits OPV behavior, and a hierarchical graph neural network that incorporates multi task learning and donor acceptor interaction modeling. This framework includes the Molecular Orbital Energy Estimator (MOE2) for predicting HOMO and LUMO energy levels, and the Photovoltaic Performance Predictor (P3) for estimating PCE. In addition, we introduce the Material Generative Pretrained Transformer (MatGPT) to produce synthetically accessible organic semiconductors, guided by a reinforcement learning strategy with three objective policy optimization. By linking molecular representation learning with performance prediction, our framework advances data driven discovery of high performance OPV materials.

CycleChemist: A Dual-Pronged Machine Learning Framework for Organic Photovoltaic Discovery

TL;DR

CycleChemist addresses the challenge of jointly designing donor and acceptor materials for OPVs by integrating a predictive core (OPVC, MOE2, P3) with a generative agent (MatGPT) trained via reinforcement learning. It introduces OPV2D, a curated dataset of 2,000 donor–acceptor pairs, and demonstrates high predictive accuracy for OPV activity (), frontier orbital energies (, ), and PCE (). The generative component achieves high validity and novelty on MOSES benchmarks, and RL-guided design yields donor–acceptor pairs with spectral complementarity validated by DFT/TD-DFT. The results showcase a scalable, data-driven pipeline for rapid OPV screening and materials discovery with practical implications for sustainable energy.

Abstract

Organic photovoltaic (OPV) materials offer a promising path toward sustainable energy generation, but their development is limited by the difficulty of identifying high performance donor and acceptor pairs with strong power conversion efficiencies (PCEs). Existing design strategies typically focus on either the donor or the acceptor alone, rather than using a unified approach capable of modeling both components. In this work, we introduce a dual machine learning framework for OPV discovery that combines predictive modeling with generative molecular design. We present the Organic Photovoltaic Donor Acceptor Dataset (OPV2D), the largest curated dataset of its kind, containing 2000 experimentally characterized donor acceptor pairs. Using this dataset, we develop the Organic Photovoltaic Classifier (OPVC) to predict whether a material exhibits OPV behavior, and a hierarchical graph neural network that incorporates multi task learning and donor acceptor interaction modeling. This framework includes the Molecular Orbital Energy Estimator (MOE2) for predicting HOMO and LUMO energy levels, and the Photovoltaic Performance Predictor (P3) for estimating PCE. In addition, we introduce the Material Generative Pretrained Transformer (MatGPT) to produce synthetically accessible organic semiconductors, guided by a reinforcement learning strategy with three objective policy optimization. By linking molecular representation learning with performance prediction, our framework advances data driven discovery of high performance OPV materials.

Paper Structure

This paper contains 28 sections, 10 equations, 6 figures, 1 table.

Figures (6)

  • Figure 1: The dual-pronged OPV materials discovery framework of our work integrates material generation, property prediction, and reinforcement learning for iterative enhancement. Molecular Orbital Energy Estimator (MOE2) is pre-trained via masked language modeling (MLM) and fine-tuned for HOMO-LUMO prediction. Photovoltaic Performance Predictor (P3) predicts power conversion efficiency (PCE) through supervised regression. Organic Photovoltaic Classifier (OPVC) is a Random Forest model that predicts the probability of a material being an organic photovoltaic (OPV). Material Generative Pretrained Transformer (MatGPT) is pre-trained using causal language modeling (CLM) and fine-tuned with reinforcement learning (RL) for high-PCE OPV molecular generation.
  • Figure 2: Dataset overview: photovoltaic parameter distributions and molecular structural diversity revealed by t-SNE visualization. The data spans a wide range of property values and structural variations.
  • Figure 3: The architecture of (a) MOE2 and (b) P3 integrates hierarchical graph neural networks, multi-task learning, and cross-attention mechanisms to jointly predict molecular electronic properties (HOMO-LUMO) and photovoltaic performance (PCE), enabling efficient donor-acceptor screening for organic photovoltaics.
  • Figure 4: MOSES benchmark performance comparison. Arrows indicate optimal direction ($\uparrow$=higher better, $\downarrow$=lower better). Best values in green, worst in red. Values show mean ± std. dev. All models were trained on the 1M PubChem dataset, except for MatGPT-Large, which was trained on 5M. N-gram generated only around 3,200 valid molecules (insufficient for Unique@10k).
  • Figure 5: Mean top-10 reward in the memory during RL training.
  • ...and 1 more figures