Accelerating materials discovery for polymer solar cells: Data-driven insights enabled by natural language processing
Pranav Shetty, Aishat Adeboye, Sonakshi Gupta, Chao Zhang, Rampi Ramprasad
TL;DR
This work presents a fully end-to-end pipeline that harvests polymer solar cell device data from the literature via natural language processing, curates a high-quality donor/acceptor PCE dataset, and trains a Gaussian Process regression model to predict power conversion efficiency. It then simulates active-learning strategies to benchmark how quickly data-driven methods could have discovered high-performance donor/acceptor pairs, showing substantial time savings—up to about 4x faster and ≈15 years of acceleration when both fullerene and non-fullerene acceptors are considered. The study demonstrates that NLP-extracted data can power robust predictive models and that active-learning strategies, particularly Gaussian Process-Thompson Sampling and GP-UCB, yield strong predictive performance and efficient discovery paths. By releasing data and software, the authors provide a framework to accelerate data-driven materials discovery across domains beyond polymer solar cells.
Abstract
We present a simulation of various active learning strategies for the discovery of polymer solar cell donor/acceptor pairs using data extracted from the literature spanning $\sim$20 years by a natural language processing pipeline. While data-driven methods have been well established to discover novel materials faster than Edisonian trial-and-error approaches, their benefits have not been quantified for material discovery problems that can take decades. Our approach demonstrates a potential reduction in discovery time by approximately 75 %, equivalent to a 15 year acceleration in material innovation. Our pipeline enables us to extract data from greater than 3300 papers which is $\sim$5 times larger and therefore more diverse than similar data sets reported by others. We also trained machine learning models to predict the power conversion efficiency and used our model to identify promising donor-acceptor combinations that are as yet unreported. We thus demonstrate a pipeline that goes from published literature to extracted material property data which in turn is used to obtain data-driven insights. Our insights include active learning strategies that can be used to train strong predictive models of material properties or be robust to the initial material system used. This work provides a valuable framework for data-driven research in materials science.
