Beyond SMILES: Evaluating Agentic Systems for Drug Discovery

Edward Wijaya

Beyond SMILES: Evaluating Agentic Systems for Drug Discovery

Edward Wijaya

TL;DR

This work interrogates whether agentic drug-discovery systems generalize beyond small-molecule workflows by evaluating six frameworks across fifteen practitioner-derived task classes. It reveals five core architectural gaps: small-molecule representation bias, absence of in vivo in silico integration, limited computational paradigms, misalignment with small biotech constraints, and single-objective optimization. A knowledge-probing study shows frontier LLMs possess peptide reasoning capabilities that current agents fail to surface due to architectural limitations, underscoring the need for integration pipelines rather than model retraining. The authors derive five design requirements for next-generation frameworks that act as computational partners, enabling multi-paradigm orchestration, modality-aware representations, in vivo data fusion, data-efficient learning, and risk-aware multi-objective optimization. These findings provide a roadmap for building agentic systems that can operate under realistic constraints and support iterative design-test cycles in diverse drug discovery contexts.

Abstract

Agentic systems for drug discovery have demonstrated autonomous synthesis planning, literature mining, and molecular design. We ask how well they generalize. Evaluating six frameworks against 15 task classes drawn from peptide therapeutics, in vivo pharmacology, and resource-constrained settings, we find five capability gaps: no support for protein language models or peptide-specific prediction, no bridges between in vivo and in silico data, reliance on LLM inference with no pathway to ML training or reinforcement learning, assumptions tied to large-pharma resources, and single-objective optimization that ignores safety-efficacy-stability trade-offs. A paired knowledge-probing experiment suggests the bottleneck is architectural rather than epistemic: four frontier LLMs reason about peptides at levels comparable to small molecules, yet no framework exposes this capability. We propose design requirements and a capability matrix for next-generation frameworks that function as computational partners under realistic constraints.

Beyond SMILES: Evaluating Agentic Systems for Drug Discovery

TL;DR

Abstract

Paper Structure (68 sections, 8 figures, 17 tables)

This paper contains 68 sections, 8 figures, 17 tables.

Introduction
The Current Landscape
Scope and Motivation
Evaluation Framework
Agent Framework Selection
Task Class Definition
Evaluation Dimensions
Analysis Approach
LLM Knowledge Probing
Results: Five Capability Gaps
Gap 1: Small-Molecule Representation Bias
Findings
Stranded Knowledge: A Diagnostic Experiment
The Peptide-Specific Challenge Space
Protein Language Models vs Molecular Fingerprints
...and 53 more sections

Figures (8)

Figure 1: The Agent Reality Gap in Drug Discovery. Left panel shows computational workflows where current agents excel: small molecule representations (SMILES strings), databases, literature mining, and virtual screening. Right panel depicts the messy reality of drug discovery: multi-modal biological data from animal studies, wet lab iteration, and multi-objective trade-offs. The gap between these contexts represents the architectural limitations addressed in this paper.
Figure 2: Stranded Knowledge: LLM Peptide Competence vs Agent Capability. Mean scores (0--3 scale) for four frontier LLMs across 50 matched question pairs spanning five pharmaceutical knowledge categories. Blue bars: small-molecule questions; orange bars: peptide questions. Error bars: 95% bootstrap confidence intervals. All four models demonstrate competent peptide reasoning at or above small-molecule levels (aggregate gap = $-0.115$, 95% CI: $[-0.255, 0.02]$, all Bonferroni-adjusted $p = 1.0$). This knowledge is stranded: no current agentic framework surfaces it through peptide-aware tools.
Figure 3: Workflow Complexity: Small Molecules vs Peptides. Top: Small molecule workflow follows a linear path from SMILES representation through RDKit property calculation to docking. Bottom: Peptide workflow branches into multiple parallel analysis streams including structural prediction, aggregation propensity, stability, immunogenicity, membrane permeability, and protease resistance, requiring integration of diverse computational tools and protein language models.
Figure 4: What Agents Can and Cannot Process. Data types grouped by agent accessibility into three tiers. Green checkmarks indicate natively supported formats (SMILES strings, literature abstracts, PDB structures, CSV assay data). Yellow half-circles denote partial support requiring heavy preprocessing (tissue imaging, clinical trajectories, clinical notes). Red X marks denote data types with no current agent support (behavioral videos, RNA-seq data). Most in vivo data modalities fall in the partial or inaccessible tiers, revealing the systematic exclusion of biological validation data from current agent architectures.
Figure 5: From LLM-Centric to Multi-Paradigm Orchestration. Top: Current LLM-centric architecture where a central language model orchestrates all tools through API calls. Bottom: Multi-paradigm architecture addressing the identified orchestration gap, where a coordinator manages fundamentally different computational paradigms (ML training pipelines, RL optimization loops, PLM fine-tuning, CV analysis, physics simulations) that execute independently with results aggregated for decision-making.
...and 3 more figures

Beyond SMILES: Evaluating Agentic Systems for Drug Discovery

TL;DR

Abstract

Beyond SMILES: Evaluating Agentic Systems for Drug Discovery

Authors

TL;DR

Abstract

Table of Contents

Figures (8)