Table of Contents
Fetching ...

polyRETRO: a Language Model Approach to predict Polymerization Class and Monomer(s) for a Target Polymer

Sakshi Agarwal, Wei Xiong, Rampi Ramprasad

TL;DR

polyRETRO tackles the challenge of translating ML-designed polymers into experimentally feasible routes by integrating large-language models into a two-objective retrosynthetic framework. The method first predicts the polymerization class from a target SMILES, then infers reaction templates and monomers to reconstruct viable synthesis paths, with natural-language templates enhancing interpretability and generalization. The approach achieves high performance, with polymerization-class accuracy of $0.98$, strong template accuracy for addition and condensation, and monomer-prediction accuracy near $0.97$ for the best model, while ring-opening routes map directly from the repeat unit. This work provides a scalable, interpretable bridge between in silico polymer design and lab-scale synthesis, enabling faster experimental validation and broader exploration of synthetic polymer space.

Abstract

While machine learning has transformed polymer design by enabling rapid property prediction and candidate generation, translating these designs into experimentally realizable materials remains a critical challenge. Traditionally, the synthesis of target polymers has relied heavily on expert intuition and prior experience. The lack of automated retrosynthetic tools to assist chemists, limit the rapid practical impact of data-driven polymer discovery. To expedite lab-scale validation and beyond, we present a retrosynthetic framework that leverages large language models (LLMs) to guide polymer synthesis. Our approach, which we call polyRETRO, involves two key steps: 1) predicting the most likely polymerization reaction class of a target polymer and 2) identifying the underlying chemical transformation templates and the corresponding monomers, using primarily natural-language based constructs. This LLM-driven framework enables direct retrosynthetic analysis given just the target polymer SMILES string. polyRETRO constitutes a initial step towards a scalable, interpretable, and generalizable approach to bridge the gap between computational design and experimental synthesis.

polyRETRO: a Language Model Approach to predict Polymerization Class and Monomer(s) for a Target Polymer

TL;DR

polyRETRO tackles the challenge of translating ML-designed polymers into experimentally feasible routes by integrating large-language models into a two-objective retrosynthetic framework. The method first predicts the polymerization class from a target SMILES, then infers reaction templates and monomers to reconstruct viable synthesis paths, with natural-language templates enhancing interpretability and generalization. The approach achieves high performance, with polymerization-class accuracy of , strong template accuracy for addition and condensation, and monomer-prediction accuracy near for the best model, while ring-opening routes map directly from the repeat unit. This work provides a scalable, interpretable bridge between in silico polymer design and lab-scale synthesis, enabling faster experimental validation and broader exploration of synthetic polymer space.

Abstract

While machine learning has transformed polymer design by enabling rapid property prediction and candidate generation, translating these designs into experimentally realizable materials remains a critical challenge. Traditionally, the synthesis of target polymers has relied heavily on expert intuition and prior experience. The lack of automated retrosynthetic tools to assist chemists, limit the rapid practical impact of data-driven polymer discovery. To expedite lab-scale validation and beyond, we present a retrosynthetic framework that leverages large language models (LLMs) to guide polymer synthesis. Our approach, which we call polyRETRO, involves two key steps: 1) predicting the most likely polymerization reaction class of a target polymer and 2) identifying the underlying chemical transformation templates and the corresponding monomers, using primarily natural-language based constructs. This LLM-driven framework enables direct retrosynthetic analysis given just the target polymer SMILES string. polyRETRO constitutes a initial step towards a scalable, interpretable, and generalizable approach to bridge the gap between computational design and experimental synthesis.

Paper Structure

This paper contains 13 sections, 5 figures, 12 tables.

Figures (5)

  • Figure 1: The retrosynthesis workflow of the polyRETRO pipeline for predicting monomers corresponding to a target polymer. Objective 1 employs a classification language model to identify the polymerization class. Objective 2(a) uses a second language model to predict the reaction template, while Objective 2(b) determines the monomers through bond cleavage or cyclization steps.
  • Figure 2: (a) The workflow for classifying polymerization class using LLMs, (b) The Accuracy of the LLMs and the machine learning models for the polymerization classification task, (c) The confusion matrix for the GPT fine-tuned model. and (d) The class-wise accuracy for each polymerization class for the GPT-finetuned model
  • Figure 3: (a) Template generation and (b) LLM finetuning workflow for the addition polymerization reaction. (c) Template generation and (d) LLM finetuning workflow for the condensation polymerization reaction.
  • Figure 4: (a) Template classification accuracy for addition polymerization by GPT and LLaMa. (b) Accuracy for template prediction per polymer class of addition polymers using GPT.
  • Figure 5: The workflow for Objective 2(b) of the polyRETRO pipeline illustrates the mapping of polymer SMILES to their corresponding monomer structures.