Table of Contents
Fetching ...

POINT$^{2}$: A Polymer Informatics Training and Testing Database

Jiaxin Xu, Gang Liu, Ruilan Guo, Meng Jiang, Tengfei Luo

TL;DR

POINT$^{2}$ establishes a standardized benchmark and workflow for polymer informatics that simultaneously targets prediction accuracy, uncertainty quantification, interpretability, and synthesizability. The framework integrates multiple ML models (including QRF, MLP-D, GNNs, and GREA) with a diverse set of polymer representations, evaluated on a benchmark dataset spanning Tg, Tm, TC, FFV, ρ, and gas permeabilities, and complemented by a polymer retrosynthesis tool and the PolyScore synthesizability metric. Key contributions include a large-scale benchmark built from labeled and unlabeled polymer data, a detailed comparison of prediction and UQ performance across representations and models, interpretable insights via SHAP and rationale analysis, and practical screening case studies on the PI1M virtual polymer space. The open-source POINT$^2$ database, together with retrosynthesis templates and PolyScore, provides a robust resource for polymer discovery and optimization, enabling transparent model evaluation and synthesis-aware design. The work advances polymer informatics by offering a unified benchmark that incorporates uncertainty and synthesizability into property prediction, and by demonstrating how combining predictive performance with retrosynthetic analysis can guide practical material design. This has immediate implications for accelerated polymer discovery and rational design in industries ranging from energy to gas separations, while also highlighting areas for future improvement, such as expanding polymerization templates and improving calibration across property spaces.

Abstract

The advancement of polymer informatics has been significantly propelled by the integration of machine learning (ML) techniques, enabling the rapid prediction of polymer properties and expediting the discovery of high-performance polymeric materials. However, the field lacks a standardized workflow that encompasses prediction accuracy, uncertainty quantification, ML interpretability, and polymer synthesizability. In this study, we introduce POINT$^{2}$ (POlymer INformatics Training and Testing), a comprehensive benchmark database and protocol designed to address these critical challenges. Leveraging the existing labeled datasets and the unlabeled PI1M dataset, a collection of approximately one million virtual polymers generated via a recurrent neural network trained on the realistic polymers, we develop an ensemble of ML models, including Quantile Random Forests, Multilayer Perceptrons with dropout, Graph Neural Networks, and pretrained large language models. These models are coupled with diverse polymer representations such as Morgan, MACCS, RDKit, Topological, Atom Pair fingerprints, and graph-based descriptors to achieve property predictions, uncertainty estimations, model interpretability, and template-based polymerization synthesizability across a spectrum of properties, including gas permeability, thermal conductivity, glass transition temperature, melting temperature, fractional free volume, and density. The POINT$^{2}$ database can serve as a valuable resource for the polymer informatics community for polymer discovery and optimization.

POINT$^{2}$: A Polymer Informatics Training and Testing Database

TL;DR

POINT establishes a standardized benchmark and workflow for polymer informatics that simultaneously targets prediction accuracy, uncertainty quantification, interpretability, and synthesizability. The framework integrates multiple ML models (including QRF, MLP-D, GNNs, and GREA) with a diverse set of polymer representations, evaluated on a benchmark dataset spanning Tg, Tm, TC, FFV, ρ, and gas permeabilities, and complemented by a polymer retrosynthesis tool and the PolyScore synthesizability metric. Key contributions include a large-scale benchmark built from labeled and unlabeled polymer data, a detailed comparison of prediction and UQ performance across representations and models, interpretable insights via SHAP and rationale analysis, and practical screening case studies on the PI1M virtual polymer space. The open-source POINT database, together with retrosynthesis templates and PolyScore, provides a robust resource for polymer discovery and optimization, enabling transparent model evaluation and synthesis-aware design. The work advances polymer informatics by offering a unified benchmark that incorporates uncertainty and synthesizability into property prediction, and by demonstrating how combining predictive performance with retrosynthetic analysis can guide practical material design. This has immediate implications for accelerated polymer discovery and rational design in industries ranging from energy to gas separations, while also highlighting areas for future improvement, such as expanding polymerization templates and improving calibration across property spaces.

Abstract

The advancement of polymer informatics has been significantly propelled by the integration of machine learning (ML) techniques, enabling the rapid prediction of polymer properties and expediting the discovery of high-performance polymeric materials. However, the field lacks a standardized workflow that encompasses prediction accuracy, uncertainty quantification, ML interpretability, and polymer synthesizability. In this study, we introduce POINT (POlymer INformatics Training and Testing), a comprehensive benchmark database and protocol designed to address these critical challenges. Leveraging the existing labeled datasets and the unlabeled PI1M dataset, a collection of approximately one million virtual polymers generated via a recurrent neural network trained on the realistic polymers, we develop an ensemble of ML models, including Quantile Random Forests, Multilayer Perceptrons with dropout, Graph Neural Networks, and pretrained large language models. These models are coupled with diverse polymer representations such as Morgan, MACCS, RDKit, Topological, Atom Pair fingerprints, and graph-based descriptors to achieve property predictions, uncertainty estimations, model interpretability, and template-based polymerization synthesizability across a spectrum of properties, including gas permeability, thermal conductivity, glass transition temperature, melting temperature, fractional free volume, and density. The POINT database can serve as a valuable resource for the polymer informatics community for polymer discovery and optimization.

Paper Structure

This paper contains 20 sections, 2 equations, 9 figures, 7 tables.

Figures (9)

  • Figure 1: Schematic representation of the integrated workflow for polymer screening, highlighting four key components: Accurate Prediction, Uncertainty Quantification, Model Interpretability, and Polymer Synthesizability.
  • Figure 2: An example of model prediction interpretation on the Tg test dataset. (a) Beeswarm plot of SHAP values on the test dataset using the QFR model and Morgan fingerprints. The x-axis represents the SHAP values, which quantify the impact of each fingerprint bit on the model's prediction—positive values increase Tg, while negative values decrease it. The y-axis lists the top-20 most important fingerprint bits, ranked in descending order by their average absolute SHAP value (i.e., the most influential bits are at the top). The color of the dots corresponds to the feature value: red indicates bit=1 in Morgan fingerprints, while blue represents bit=0. (b) Molecular visualization of important bits in the Morgan fingerprint. "A" is a wildcard atom represents any atom type and "*" represents the polymerization point in the repeated unit of polymers. (c) Molecular structure and rationale interpretation (highlighted in green) of two polymer explicands from the GREA model. (d) Waterfall plot of SHAP values of the same two polymer explicands in panel (c) using the QFR model and Morgan fingerprints. The x-axis shows the cumulative SHAP value contributions leading to the final prediction $f(x)$, with the base value (expected model output) at the far left and the final model prediction at the far right. Each bar represents the contribution of a single fingerprint bit. Bars are annotated with the bit ID and the magnitude of their contribution.
  • Figure 3: Examples of retrosynthesis planning of polymers from: (a) condensation, (b) addition, and (c) ring-opening polymerization. The condensation reactions involve monomers with reactive groups that release small molecules upon polymerization. Addition polymerizations involve linking monomers via reactive double bonds. Ring-opening polymerizations involve the cleavage of ring structures to form linear chains.
  • Figure 4: Results of Case Study 1: designing high-performance polymers for thermal management. The top row shows radar plots with design constraints (shaded gray regions) for key properties: density ($0.8{-}1.2 \, \mathrm{g/cm^3}$), FFV ($0.3{-}0.35$), TC ($>0.35\, \mathrm{W/mK}$), Tg ($> 250^\circ\mathrm{C}$), and Tm ($>350^\circ\mathrm{C}$). The red line and numbers indicate the predicted mean property values, while the shaded red area represents the uncertainty range. Models used for predictions are: FFV (MLP-D+AP), TC (MLP-D+TT), Tg (MLP-D+Morgan),Tm (MLP-D+RDKit), and density (MLP-D+Morgan). The middle row displays the molecular structure and SMILES of the candidate polymers. The bottom row shows retrosynthetic pathways, where each box represents a potential route. Boxes with solid borders indicate all monomers are available in PubChem (CID provided), while dashed borders denote at least one monomer is unknown (UNK). The synthesizability of monomers is quantified by SAScore. The PolyScore of each proposed route is shown at the bottom. Candidate 3 is the final selection due to its optimal properties and the most feasible synthesis route, as highlighted with a green box.
  • Figure 5: Results of Case Study 2: designing high-performance polymers for gas separation membranes. The top row shows radar plots with design constraints (shaded gray regions) for key properties: density ($>1.4 \, \mathrm{g/cm^3}$), FFV ($>0.35$), Tg ($>180^\circ\mathrm{C}$), log$_{10}$(PCH$_4$ in Barrer) ($<-0.8$), and log$_{10}$(PCO$_2$ in Barrer) ($>1.5$). The red line and numbers indicate the predicted mean property values, while the shaded red area represents the uncertainty range. Models used for predictions are: FFV (MLP-D+AP), Tg (MLP-D+Morgan), density (MLP-D+Morgan), P(CH$_4$) (QRF+TT), and P(CO$_2$) ((MLP-D+AP)). The middle row displays the molecular structure and SMILES of the candidate polymers. The bottom row shows retrosynthetic pathways, where each box represents a potential route. No route is identified using current templates for Candidate 2. Boxes with solid borders indicate all monomers are available in PubChem (CID provided), while dashed borders denote at least one monomer is unknown (UNK). The synthesizability of monomers is quantified by SAScore. The PolyScore of each proposed route is shown at the bottom. Candidate 3 is the final selection due to its optimal properties and the most feasible synthesis route, as highlighted with a green box.
  • ...and 4 more figures