Table of Contents
Fetching ...

MDCrow: Automating Molecular Dynamics Workflows with Large Language Models

Quintina Campbell, Sam Cox, Jorge Medina, Brittany Watterson, Andrew D. White

TL;DR

MDCrow contributes an agentic LLM framework to automate Molecular Dynamics workflows by composing an environment of domain-specific tools for information retrieval, PDB handling, simulation setup, and analysis. Through chain-of-thought reasoning and tool use within a LangChain/ReAct paradigm, MDCrow demonstrates substantial task completion and robustness across 25 prompts, with gpt-4o and llama-405b performing best. The study shows MDCrow can outperform baselines and even extrapolate to tasks outside its explicit toolset via interactive chatting, signaling a step toward scalable, automated MD research pipelines. The work provides open-source code and emphasizes careful evaluation and potential future enhancements as LLM capabilities advance.

Abstract

Molecular dynamics (MD) simulations are essential for understanding biomolecular systems but remain challenging to automate. Recent advances in large language models (LLM) have demonstrated success in automating complex scientific tasks using LLM-based agents. In this paper, we introduce MDCrow, an agentic LLM assistant capable of automating MD workflows. MDCrow uses chain-of-thought over 40 expert-designed tools for handling and processing files, setting up simulations, analyzing the simulation outputs, and retrieving relevant information from literature and databases. We assess MDCrow's performance across 25 tasks of varying required subtasks and difficulty, and we evaluate the agent's robustness to both difficulty and prompt style. \texttt{gpt-4o} is able to complete complex tasks with low variance, followed closely by \texttt{llama3-405b}, a compelling open-source model. While prompt style does not influence the best models' performance, it has significant effects on smaller models.

MDCrow: Automating Molecular Dynamics Workflows with Large Language Models

TL;DR

MDCrow contributes an agentic LLM framework to automate Molecular Dynamics workflows by composing an environment of domain-specific tools for information retrieval, PDB handling, simulation setup, and analysis. Through chain-of-thought reasoning and tool use within a LangChain/ReAct paradigm, MDCrow demonstrates substantial task completion and robustness across 25 prompts, with gpt-4o and llama-405b performing best. The study shows MDCrow can outperform baselines and even extrapolate to tasks outside its explicit toolset via interactive chatting, signaling a step toward scalable, automated MD research pipelines. The work provides open-source code and emphasizes careful evaluation and potential future enhancements as LLM capabilities advance.

Abstract

Molecular dynamics (MD) simulations are essential for understanding biomolecular systems but remain challenging to automate. Recent advances in large language models (LLM) have demonstrated success in automating complex scientific tasks using LLM-based agents. In this paper, we introduce MDCrow, an agentic LLM assistant capable of automating MD workflows. MDCrow uses chain-of-thought over 40 expert-designed tools for handling and processing files, setting up simulations, analyzing the simulation outputs, and retrieving relevant information from literature and databases. We assess MDCrow's performance across 25 tasks of varying required subtasks and difficulty, and we evaluate the agent's robustness to both difficulty and prompt style. \texttt{gpt-4o} is able to complete complex tasks with low variance, followed closely by \texttt{llama3-405b}, a compelling open-source model. While prompt style does not influence the best models' performance, it has significant effects on smaller models.

Paper Structure

This paper contains 19 sections, 7 figures.

Figures (7)

  • Figure 1: A. MDCrow workflow. Starting with a user prompt and initialized with a set of MD tools, MDCrow follows a chain-of-thought process until it completes all tasks in the prompt. The final output includes a response, along with all resulting analyses and files. B. The tool distribution categorized into 4 types: information retrieval, PDB and protein handling, simulation, and analysis. A few examples from each category are shown. C. Two example prompts that MDCrow is tested on. The first is the simplest prompt, containing only 1 subtask. The most complex task requires 10 subtasks. D. Average subtask completion across all 25 prompts as task complexity (the number of subtasks per prompt) increases. The top three performing base-LLMs are shown. Among them, gpt-4o and llama3-405b consistently maintain high stability, staying close to 100% completion even as task complexity increases.
  • Figure 2: Example Chat Example of chat with MDCrow. The user first asks to download PDB files for two systems. Then, once MDCrow has completed this task, the user asks for analysis of the files. Next, the user asks for a quick 10 ps simulation of both files, and MDCrow saves all files for later handling. Lastly, the user asks for plots of RMSD for each simulation over time, and MDCrow responds with each plot.
  • Figure 3: MDCrow Performance across Large Language Models. A. Summary of MDCrow performance dependent on LLM. Percentage of accuracy is determined by whether it gave acceptable final answer or not. While statistically indistinguishable from Claude and Llama models, gpt-4o significantly outperforms the rest of GPT models on giving accurate solutions (t-test, $0.004 \le$ p-value $\le 0.046$). B. The distribution of number of subtasks in each task across 25 prompts. The prompts range from 1-10 steps, with each step count belonging to at least 2 prompts. C. Percentages of prompts with accurate solutions with respect to LLM used and number of subtasks per task. The correlation between accuracy and complexity is statistically significant for all LLMs (Spearman correlation, $3.9\times10^{-7} \le$ p-value $\le 1.1\times10^{-2}$) D. Percentage of the subtasks that the agent completed for each base LLM per task.
  • Figure 4: A. The number of subtasks in each task, categorized by type. Task 1 begins with a single pre-simulation subtask (Download a PDB file) and each subsequent task adds a single subtask, adding to a total of 10 tasks with a maximum of 10 subtasks. B. Example of "Natural" and "Ordered" prompt style on a three-step prompt. C. The robustness of MDCrow built on each model with both prompt types, measured by coefficient of variation (CV). Lower CV is interpreted as greater consistency. gpt-4o and llama3-405b are the more robust models, as the Claude models have higher CVs. D. Comparison of subtask completion across models and prompt types. In the 9-subtask prompt, gpt-4o encountered an error after only one step and gave up without trying to fix it. In general, gpt-4o and llama3-405b have relatively robust performance with increasing complexity for both prompt types. claude-3-opus struggles with more complex tasks, making more logical errors for complex tasks. The two claude-3.5-sonnet models showed fairly poor performance across this experiment.
  • Figure 5: Performance across LLM Frameworks using the same 25-prompt set: MDCrow, direct LLM with no tools (single-query), and ReAct agent with only Python REPL tool. All use gpt-4o. A. Performance among LLM frameworks measured by whether accuracy and average percentage of subtasks they complete for each of 25 task prompts. MDCrow is significantly better at giving accurate solutions than direct LLM (t-test, $p=1\times10^{-3}$) and ReAct (t-test, $p=4\times10^{-4}$). MDCrow completes significantly more subtasks on average compared to direct LLM (t-test, $p=1\times10^{-6}$) and ReAct (t-test, $p=6\times10^{-6}$). B. Percentage of tasks completed with the respect to LLM framework used and the number of subtasks required for each task. The correlation between accuracy and number of subtasks required is statistically significant, $p=1\times10^{-3}$ for direct LLM and $p=1\times10^{-4}$ MDCrow. The p value for ReAct is $p=7\times10^{-2}$.
  • ...and 2 more figures