Accelerating materials discovery for polymer solar cells: Data-driven insights enabled by natural language processing

Pranav Shetty; Aishat Adeboye; Sonakshi Gupta; Chao Zhang; Rampi Ramprasad

Accelerating materials discovery for polymer solar cells: Data-driven insights enabled by natural language processing

Pranav Shetty, Aishat Adeboye, Sonakshi Gupta, Chao Zhang, Rampi Ramprasad

TL;DR

This work presents a fully end-to-end pipeline that harvests polymer solar cell device data from the literature via natural language processing, curates a high-quality donor/acceptor PCE dataset, and trains a Gaussian Process regression model to predict power conversion efficiency. It then simulates active-learning strategies to benchmark how quickly data-driven methods could have discovered high-performance donor/acceptor pairs, showing substantial time savings—up to about 4x faster and ≈15 years of acceleration when both fullerene and non-fullerene acceptors are considered. The study demonstrates that NLP-extracted data can power robust predictive models and that active-learning strategies, particularly Gaussian Process-Thompson Sampling and GP-UCB, yield strong predictive performance and efficient discovery paths. By releasing data and software, the authors provide a framework to accelerate data-driven materials discovery across domains beyond polymer solar cells.

Abstract

We present a simulation of various active learning strategies for the discovery of polymer solar cell donor/acceptor pairs using data extracted from the literature spanning $\sim$20 years by a natural language processing pipeline. While data-driven methods have been well established to discover novel materials faster than Edisonian trial-and-error approaches, their benefits have not been quantified for material discovery problems that can take decades. Our approach demonstrates a potential reduction in discovery time by approximately 75 %, equivalent to a 15 year acceleration in material innovation. Our pipeline enables us to extract data from greater than 3300 papers which is $\sim$5 times larger and therefore more diverse than similar data sets reported by others. We also trained machine learning models to predict the power conversion efficiency and used our model to identify promising donor-acceptor combinations that are as yet unreported. We thus demonstrate a pipeline that goes from published literature to extracted material property data which in turn is used to obtain data-driven insights. Our insights include active learning strategies that can be used to train strong predictive models of material properties or be robust to the initial material system used. This work provides a valuable framework for data-driven research in materials science.

Accelerating materials discovery for polymer solar cells: Data-driven insights enabled by natural language processing

TL;DR

Abstract

We present a simulation of various active learning strategies for the discovery of polymer solar cell donor/acceptor pairs using data extracted from the literature spanning

20 years by a natural language processing pipeline. While data-driven methods have been well established to discover novel materials faster than Edisonian trial-and-error approaches, their benefits have not been quantified for material discovery problems that can take decades. Our approach demonstrates a potential reduction in discovery time by approximately 75 %, equivalent to a 15 year acceleration in material innovation. Our pipeline enables us to extract data from greater than 3300 papers which is

5 times larger and therefore more diverse than similar data sets reported by others. We also trained machine learning models to predict the power conversion efficiency and used our model to identify promising donor-acceptor combinations that are as yet unreported. We thus demonstrate a pipeline that goes from published literature to extracted material property data which in turn is used to obtain data-driven insights. Our insights include active learning strategies that can be used to train strong predictive models of material properties or be robust to the initial material system used. This work provides a valuable framework for data-driven research in materials science.

Paper Structure (16 sections, 2 equations, 7 figures, 5 tables)

This paper contains 16 sections, 2 equations, 7 figures, 5 tables.

Introduction
Methods
Data extraction pipeline from literature
Creating PolymerSolarCells$_{NLP}$the polymer solar cells data set
Curating PolymerSolarCells$_{NLP}$ to create PolymerSolarCells$_{Curated}$Data curation
Machine learning prediction of power conversion efficiency
Data selection methods for simulating active learning of polymer solar cells
Results and discussion
Analysis of polymer solar cell data
Predicting power conversion efficiency
Simulating the 'discovery' of new donor/acceptor combinations
Analyzing the predictions from data selection methods
Summary and Outlook
Supporting Information
Acknowledgements
...and 1 more sections

Figures (7)

Figure 1: Pipeline used for extracting polymer solar cell PCE data from published literature which is then used in two ways 1) To train ML models of PCE and predict high-performing donor/acceptor pairs not reported in the literature and 2) To simulate an active learning loop through which donor/acceptor pairs are 'discovered' sequentially. (J71, Y6) referenced in the figure is a donor/acceptor pair.
Figure 2: Entire donor/acceptor space at a glance. The donors and acceptors that are most frequently reported are shown along the axes. The top three most commonly reported donors and acceptors are spaced uniformly along each corresponding axes so that they can be clearly distinguished. The remaining donors and acceptors are randomly ordered.
Figure 3: Parity plot for a machine learning model trained to predict power conversion efficiency. a) Model trained using only donors as input b) Model trained using donors and acceptors as input to the model
Figure 4: Predicted power conversion efficiency value for the entire donor/acceptor space. The donors and acceptors with the highest average PCE are shown along the axes. The ordering of donors and acceptors is the same as Figure \ref{['fig:donor_acceptor_space']}.
Figure 5: Comparing the simulated path of material systems generated by data selection methods against the evolution of power conversion efficiency in the experimental literature. a) Both fullerene and non-fullerene acceptors included among candidate material systems b) fullerene acceptor only c) non-fullerene acceptors only. Fig. d-f shows the range of values obtained over ten5 different starting material systems for each data selection method tested for the acceptors used in the row above. In the box and whisker plot in Fig. d-f, the red line indicates the median of the data while the cross represents the mean of the data. Hollow circles are outliers.
...and 2 more figures

Accelerating materials discovery for polymer solar cells: Data-driven insights enabled by natural language processing

TL;DR

Abstract

Accelerating materials discovery for polymer solar cells: Data-driven insights enabled by natural language processing

Authors

TL;DR

Abstract

Table of Contents

Figures (7)