GLaD: Synergizing Molecular Graphs and Language Descriptors for Enhanced Power Conversion Efficiency Prediction in Organic Photovoltaic Devices

Thao Nguyen; Tiara Torres-Flores; Changhyun Hwang; Carl Edwards; Ying Diao; Heng Ji

GLaD: Synergizing Molecular Graphs and Language Descriptors for Enhanced Power Conversion Efficiency Prediction in Organic Photovoltaic Devices

Thao Nguyen, Tiara Torres-Flores, Changhyun Hwang, Carl Edwards, Ying Diao, Heng Ji

TL;DR

GLaD addresses the challenge of predicting PCE for organic photovoltaic devices in data-scarce regimes by integrating multimodal molecular representations: fragment-level structural features from graph neural networks and textual descriptions of functional modules generated by large language models. The approach uses a hierarchical GNN to combine module-level embeddings with donor–acceptor pair information, augmented by LLM-derived textual descriptors embedded with SciBERT and fused through attention mechanisms. The authors curate a dedicated OPV dataset (500 D–A pairs, 403 molecules, 250 modules) and demonstrate that incorporating textual descriptors yields meaningful improvements in $R^2$ across collected, HOPV, and MoleculeNet benchmarks, including strong performance on experimental PCE prediction and several classification tasks. This work enables more reliable, low-data molecular design in OPVs and shows potential applicability to broader chemical discovery problems, with future work focusing on uncertainty quantification for high-throughput screening.

Abstract

This paper presents a novel approach for predicting Power Conversion Efficiency (PCE) of Organic Photovoltaic (OPV) devices, called GLaD: synergizing molecular Graphs and Language Descriptors for enhanced PCE prediction. Due to the lack of high-quality experimental data, we collect a dataset consisting of 500 pairs of OPV donor and acceptor molecules along with their corresponding PCE values, which we utilize as the training data for our predictive model. In this low-data regime, GLaD leverages properties learned from large language models (LLMs) pretrained on extensive scientific literature to enrich molecular structural representations, allowing for a multimodal representation of molecules. GLaD achieves precise predictions of PCE, thereby facilitating the synthesis of new OPV molecules with improved efficiency. Furthermore, GLaD showcases versatility, as it applies to a range of molecular property prediction tasks (BBBP, BACE, ClinTox, and SIDER), not limited to those concerning OPV materials. Especially, GLaD proves valuable for tasks in low-data regimes within the chemical space, as it enriches molecular representations by incorporating molecular property descriptions learned from large-scale pretraining. This capability is significant in real-world scientific endeavors like drug and material discovery, where access to comprehensive data is crucial for informed decision-making and efficient exploration of the chemical space.

GLaD: Synergizing Molecular Graphs and Language Descriptors for Enhanced Power Conversion Efficiency Prediction in Organic Photovoltaic Devices

TL;DR

across collected, HOPV, and MoleculeNet benchmarks, including strong performance on experimental PCE prediction and several classification tasks. This work enables more reliable, low-data molecular design in OPVs and shows potential applicability to broader chemical discovery problems, with future work focusing on uncertainty quantification for high-throughput screening.

Abstract

Paper Structure (18 sections, 2 equations, 7 figures, 9 tables)

This paper contains 18 sections, 2 equations, 7 figures, 9 tables.

Introduction
Background
Related Work
PCE Prediction
Multimodal Representation of Molecules: Graph Structure and Textual Descriptions
OPV Dataset Collection
Fusing Text with Molecular Structure
Modeling Molecular Structure
Generating Textual Descriptions for Functional Modules
Modeling Textual Descriptions
Fusion Approaches
Experiments
Experimental Settings
Main Results
Conclusion & Future Work
...and 3 more sections

Figures (7)

Figure 1: Overview of the GLaD PCE prediction framework.
Figure 2: Distribution of pairwise Tanimoto distances among molecules in the collected dataset.
Figure 3: Distribution of pairwise Tanimoto distances among molecules in the HOPV dataset.
Figure 4: Examples of functional module descriptions generated by ChatGPT 3.5.
Figure 5: UMAP representation of text descriptions for ring-containing and non-ring-containing fragments in the text embedding space.
...and 2 more figures

GLaD: Synergizing Molecular Graphs and Language Descriptors for Enhanced Power Conversion Efficiency Prediction in Organic Photovoltaic Devices

TL;DR

Abstract

GLaD: Synergizing Molecular Graphs and Language Descriptors for Enhanced Power Conversion Efficiency Prediction in Organic Photovoltaic Devices

Authors

TL;DR

Abstract

Table of Contents

Figures (7)