Table of Contents
Fetching ...

Protect$^*$: Steerable Retrosynthesis through Neuro-Symbolic State Encoding

Shreyas Vinaya Sathyanarayana, Shah Rahil Kirankumar, Sharanabasava D. Hiremath, Bharath Ramsundar

TL;DR

This work introduces Protect, a neuro-symbolic framework that grounds the generative capabilities of Large Language Models in rigorous chemical logic, and demonstrates that grounding neural generation in symbolic logic enables reliable, expert-level autonomy.

Abstract

Large Language Models (LLMs) have shown remarkable potential in scientific domains like retrosynthesis; yet, they often lack the fine-grained control necessary to navigate complex problem spaces without error. A critical challenge is directing an LLM to avoid specific, chemically sensitive sites on a molecule - a task where unconstrained generation can lead to invalid or undesirable synthetic pathways. In this work, we introduce Protect$^*$, a neuro-symbolic framework that grounds the generative capabilities of Large Language Models (LLMs) in rigorous chemical logic. Our approach combines automated rule-based reasoning - using a comprehensive database of 55+ SMARTS patterns and 40+ characterized protecting groups - with the generative intuition of neural models. The system operates via a hybrid architecture: an ``automatic mode'' where symbolic logic deterministically identifies and guards reactive sites, and a ``human-in-the-loop mode'' that integrates expert strategic constraints. Through ``active state tracking,'' we inject hard symbolic constraints into the neural inference process via a dedicated protection state linked to canonical atom maps. We demonstrate this neuro-symbolic approach through case studies on complex natural products, including the discovery of a novel synthetic pathway for Erythromycin B, showing that grounding neural generation in symbolic logic enables reliable, expert-level autonomy.

Protect$^*$: Steerable Retrosynthesis through Neuro-Symbolic State Encoding

TL;DR

This work introduces Protect, a neuro-symbolic framework that grounds the generative capabilities of Large Language Models in rigorous chemical logic, and demonstrates that grounding neural generation in symbolic logic enables reliable, expert-level autonomy.

Abstract

Large Language Models (LLMs) have shown remarkable potential in scientific domains like retrosynthesis; yet, they often lack the fine-grained control necessary to navigate complex problem spaces without error. A critical challenge is directing an LLM to avoid specific, chemically sensitive sites on a molecule - a task where unconstrained generation can lead to invalid or undesirable synthetic pathways. In this work, we introduce Protect, a neuro-symbolic framework that grounds the generative capabilities of Large Language Models (LLMs) in rigorous chemical logic. Our approach combines automated rule-based reasoning - using a comprehensive database of 55+ SMARTS patterns and 40+ characterized protecting groups - with the generative intuition of neural models. The system operates via a hybrid architecture: an ``automatic mode'' where symbolic logic deterministically identifies and guards reactive sites, and a ``human-in-the-loop mode'' that integrates expert strategic constraints. Through ``active state tracking,'' we inject hard symbolic constraints into the neural inference process via a dedicated protection state linked to canonical atom maps. We demonstrate this neuro-symbolic approach through case studies on complex natural products, including the discovery of a novel synthetic pathway for Erythromycin B, showing that grounding neural generation in symbolic logic enables reliable, expert-level autonomy.
Paper Structure (15 sections, 5 figures, 2 tables, 1 algorithm)

This paper contains 15 sections, 5 figures, 2 tables, 1 algorithm.

Figures (5)

  • Figure 1: The Protect$^*$ architecture. The system identifies potential reaction sites in the molecule that can be protected. It then suggests a list of protecting groups for each site. The chemist tags the reaction site that is to be protected. This is then tracked in the persistent protection state where the framework links the protecting group to specific atom map numbers. The atom-mapped SMILES along with an updated LLM prompt containing explicit constraints is then sent into the DeepRetro engine to obtain the retrosynthesis pathway.
  • Figure 2: Retrosynthetic strategy for Erythromycin B generated by the Protect$^*$. The Hydroxyl Groups from the skeleton are protected with Ethoxy (OEt) protecting groups highlighted in red.
  • Figure 3: Steered disconnection strategy for Prostaglandin E2. By protecting the side chain hydroxyl, the LLM is guided away from that disconnection. The latter is protected with Methoxy (OMe) as protecting group (hightlighted in red)
  • Figure 4: Stereochemistry-preserving pathway for Quinine. The symbolic protection of the secondary alcohol with Ethoxy (OEt) group (hightlighted in red) safeguards a critical stereocenter. The steered LLM successfully navigates the complex bicyclic structure to propose a valid disconnection at a different site, demonstrating the method's ability to enforce stereochemical constraints and discover viable pathways for stereochemically rich molecules.
  • Figure 5: The in-context prompt template used to guide the LLM. It provides the canonically atom-mapped SMILES and explicitly lists the protected sites by their atom map numbers and active protecting groups, instructing the model to treat them as inert.