GLaD: Synergizing Molecular Graphs and Language Descriptors for Enhanced Power Conversion Efficiency Prediction in Organic Photovoltaic Devices
Thao Nguyen, Tiara Torres-Flores, Changhyun Hwang, Carl Edwards, Ying Diao, Heng Ji
TL;DR
GLaD addresses the challenge of predicting PCE for organic photovoltaic devices in data-scarce regimes by integrating multimodal molecular representations: fragment-level structural features from graph neural networks and textual descriptions of functional modules generated by large language models. The approach uses a hierarchical GNN to combine module-level embeddings with donor–acceptor pair information, augmented by LLM-derived textual descriptors embedded with SciBERT and fused through attention mechanisms. The authors curate a dedicated OPV dataset (500 D–A pairs, 403 molecules, 250 modules) and demonstrate that incorporating textual descriptors yields meaningful improvements in $R^2$ across collected, HOPV, and MoleculeNet benchmarks, including strong performance on experimental PCE prediction and several classification tasks. This work enables more reliable, low-data molecular design in OPVs and shows potential applicability to broader chemical discovery problems, with future work focusing on uncertainty quantification for high-throughput screening.
Abstract
This paper presents a novel approach for predicting Power Conversion Efficiency (PCE) of Organic Photovoltaic (OPV) devices, called GLaD: synergizing molecular Graphs and Language Descriptors for enhanced PCE prediction. Due to the lack of high-quality experimental data, we collect a dataset consisting of 500 pairs of OPV donor and acceptor molecules along with their corresponding PCE values, which we utilize as the training data for our predictive model. In this low-data regime, GLaD leverages properties learned from large language models (LLMs) pretrained on extensive scientific literature to enrich molecular structural representations, allowing for a multimodal representation of molecules. GLaD achieves precise predictions of PCE, thereby facilitating the synthesis of new OPV molecules with improved efficiency. Furthermore, GLaD showcases versatility, as it applies to a range of molecular property prediction tasks (BBBP, BACE, ClinTox, and SIDER), not limited to those concerning OPV materials. Especially, GLaD proves valuable for tasks in low-data regimes within the chemical space, as it enriches molecular representations by incorporating molecular property descriptions learned from large-scale pretraining. This capability is significant in real-world scientific endeavors like drug and material discovery, where access to comprehensive data is crucial for informed decision-making and efficient exploration of the chemical space.
