AgentDrug: Utilizing Large Language Models in an Agentic Workflow for Zero-Shot Molecular Optimization
Khiem Le, Ting Hua, Nitesh V. Chawla
TL;DR
AgentDrug tackles zero-shot molecular optimization by embedding LLMs in an agentic, two-loop workflow that first ensures chemical validity via an inner loop that fixes ParseError-driven issues in SMILES, then uses a gradient-guided outer loop and generic feedback to steer toward objective improvements. A retrieved in-context exemplar $m_e$ from a curated database augments guidance, with similarity measured by the Tanimoto coefficient, enabling targeted refinements while maintaining structural similarity to the input molecule $m$. Empirically, AgentDrug yields substantial accuracy gains over ChatDrug and baselines across single- and multi-property tasks, with larger LLMs delivering larger gains, and the inner loop significantly reducing molecular hallucination to improve retrieval reliability. The results demonstrate the practical viability of gradient-guided LLM refinement in chemical space and highlight the importance of early validity checks for robust retrieval-augmented optimization, albeit with acknowledged prompting costs and opportunities for more actionable feedback beyond gradients.
Abstract
Molecular optimization -- modifying a given molecule to improve desired properties -- is a fundamental task in drug discovery. While LLMs hold the potential to solve this task using natural language to drive the optimization, straightforward prompting achieves limited accuracy. In this work, we propose AgentDrug, an agentic workflow that leverages LLMs in a structured refinement process to achieve significantly higher accuracy. AgentDrug defines a nested refinement loop: the inner loop uses feedback from cheminformatics toolkits to validate molecular structures, while the outer loop guides the LLM with generic feedback and a gradient-based objective to steer the molecule toward property improvement. We evaluate AgentDrug on benchmarks with both single- and multi-property optimization under loose and strict thresholds. Results demonstrate significant performance gains over previous methods. With Qwen-2.5-3B, AgentDrug improves accuracy by 20.7\% (loose) and 16.8\% (strict) on six single-property tasks, and by 7.0\% and 5.3\% on eight multi-property tasks. With larger model Qwen-2.5-7B, AgentDrug further improves accuracy on 6 single-property objectives by 28.9\% (loose) and 29.0\% (strict), and on 8 multi-property objectives by 14.9\% (loose) and 13.2\% (strict).
