POINT$^{2}$: A Polymer Informatics Training and Testing Database
Jiaxin Xu, Gang Liu, Ruilan Guo, Meng Jiang, Tengfei Luo
TL;DR
POINT$^{2}$ establishes a standardized benchmark and workflow for polymer informatics that simultaneously targets prediction accuracy, uncertainty quantification, interpretability, and synthesizability. The framework integrates multiple ML models (including QRF, MLP-D, GNNs, and GREA) with a diverse set of polymer representations, evaluated on a benchmark dataset spanning Tg, Tm, TC, FFV, ρ, and gas permeabilities, and complemented by a polymer retrosynthesis tool and the PolyScore synthesizability metric. Key contributions include a large-scale benchmark built from labeled and unlabeled polymer data, a detailed comparison of prediction and UQ performance across representations and models, interpretable insights via SHAP and rationale analysis, and practical screening case studies on the PI1M virtual polymer space. The open-source POINT$^2$ database, together with retrosynthesis templates and PolyScore, provides a robust resource for polymer discovery and optimization, enabling transparent model evaluation and synthesis-aware design. The work advances polymer informatics by offering a unified benchmark that incorporates uncertainty and synthesizability into property prediction, and by demonstrating how combining predictive performance with retrosynthetic analysis can guide practical material design. This has immediate implications for accelerated polymer discovery and rational design in industries ranging from energy to gas separations, while also highlighting areas for future improvement, such as expanding polymerization templates and improving calibration across property spaces.
Abstract
The advancement of polymer informatics has been significantly propelled by the integration of machine learning (ML) techniques, enabling the rapid prediction of polymer properties and expediting the discovery of high-performance polymeric materials. However, the field lacks a standardized workflow that encompasses prediction accuracy, uncertainty quantification, ML interpretability, and polymer synthesizability. In this study, we introduce POINT$^{2}$ (POlymer INformatics Training and Testing), a comprehensive benchmark database and protocol designed to address these critical challenges. Leveraging the existing labeled datasets and the unlabeled PI1M dataset, a collection of approximately one million virtual polymers generated via a recurrent neural network trained on the realistic polymers, we develop an ensemble of ML models, including Quantile Random Forests, Multilayer Perceptrons with dropout, Graph Neural Networks, and pretrained large language models. These models are coupled with diverse polymer representations such as Morgan, MACCS, RDKit, Topological, Atom Pair fingerprints, and graph-based descriptors to achieve property predictions, uncertainty estimations, model interpretability, and template-based polymerization synthesizability across a spectrum of properties, including gas permeability, thermal conductivity, glass transition temperature, melting temperature, fractional free volume, and density. The POINT$^{2}$ database can serve as a valuable resource for the polymer informatics community for polymer discovery and optimization.
