ProtAgents: Protein discovery via large language model multi-agent collaborations combining physics and machine learning

A. Ghafarollahi; M. J. Buehler

ProtAgents: Protein discovery via large language model multi-agent collaborations combining physics and machine learning

A. Ghafarollahi, M. J. Buehler

TL;DR

ProtAgents presents a GPT-4–powered multi-agent framework that orchestrates knowledge retrieval, physics-based simulations, and de novo protein design to enable autonomous, multi-objective protein discovery. By integrating Chroma for design, OmegaFold for folding, DSSP and ANM analyses for structure and dynamics, and ForceGPT for mechanical properties, the approach demonstrates end-to-end automation across three experiments, including CATH-conditioned design. The work highlights the potential of AI-driven agent collaboration to reduce human intervention while leveraging physics data and literature retrieval to explore vast design spaces. This platform paves the way for autonomous materials discovery and design, with implications for rapid exploration of sequence-structure-property relationships and multi-domain integration in protein engineering.

Abstract

Designing de novo proteins beyond those found in nature holds significant promise for advancements in both scientific and engineering applications. Current methodologies for protein design often rely on AI-based models, such as surrogate models that address end-to-end problems by linking protein structure to material properties or vice versa. However, these models frequently focus on specific material objectives or structural properties, limiting their flexibility when incorporating out-of-domain knowledge into the design process or comprehensive data analysis is required. In this study, we introduce ProtAgents, a platform for de novo protein design based on Large Language Models (LLMs), where multiple AI agents with distinct capabilities collaboratively address complex tasks within a dynamic environment. The versatility in agent development allows for expertise in diverse domains, including knowledge retrieval, protein structure analysis, physics-based simulations, and results analysis. The dynamic collaboration between agents, empowered by LLMs, provides a versatile approach to tackling protein design and analysis problems, as demonstrated through diverse examples in this study. The problems of interest encompass designing new proteins, analyzing protein structures and obtaining new first-principles data -- natural vibrational frequencies -- via physics simulations. The concerted effort of the system allows for powerful automated and synergistic design of de novo proteins with targeted mechanical properties. The flexibility in designing the agents, on one hand, and their capacity in autonomous collaboration through the dynamic LLM-based multi-agent environment on the other hand, unleashes great potentials of LLMs in addressing multi-objective materials problems and opens up new avenues for autonomous materials discovery and design.

ProtAgents: Protein discovery via large language model multi-agent collaborations combining physics and machine learning

TL;DR

Abstract

Paper Structure (7 sections, 6 figures, 4 tables)

This paper contains 7 sections, 6 figures, 4 tables.

Introduction
Results and Discussion
Experiment I: Knowledge retrieval, computations, and analysis
Experiment II: De novo protein design using Chroma
Experiment III: Protein design conditioned on the protein CATH class
Conclusions
Materials and Methods

Figures (6)

Figure 1: Multi-agent AI framework for automating protein discovery and analysis.a, A genetic agent structure in a multi-agent modeling environment that can communicate via language, has a focus defined by a profile, and has access to custom functions. b, A function is customized by a profile and a set of parameters. c, The structure of a team of agents, each with special expertise, that communicate to each other and allow for mutual correction and a division of labor. Given different profiles for each agent, agents are designed that are expert on describing the problem (user_proxy), plan making (planner), function executing (assistant), and result evaluation (critic). The whole process is automated via a dynamic group chat under the leading chat manager, offering a versatile approach in solving challenging tasks in the context of protein design and analysis without human intervention.
Figure 2: A generic flowchart showing the dynamic interaction between the multi-agent team members organized by the group chat manager to solve protein design and analysis problems. The manager selects the working agents to collaborate in the team work based on the current context of the chat, thus forming close interactions and enabling mutual corrections.
Figure 3: Overview of the multi-agent work to solve the complex task posed in experiment II, Section \ref{['sec:exp_2']}. First the multi-agent uses Chroma to generate de novo protein sequences and then computes natural frequencies and secondary structures content for the generated structures. Next, from de novo AA sequences, the model finds the 3D folded structures using OmegaFold and finally computes the frequencies and secondary structure content for the protein structures. The results obtained from the Chroma and OmegaFold 3D protein structures are compared in Figure \ref{['fig:chroma_fold']}.
Figure 4: Overview of the multi-agent work to solve the complex task posed in experiment III, Section \ref{['sec:exp_3']}. First the multi-agent uses Chroma to generate de novo protein sequences and structures conditioned on the input CATH class. Then using the generated protein structures, the natural frequencies and secondary structures content are computed. Next, the force (maximum force along the unfolding force-extension curve) and energy (the area under the force-extension curve) are computed from de novo AA sequences using ProteinForceGPT.
Figure 5: The results generated by the multi-agent collaboration for the experiment II, Section \ref{['sec:exp_2']}. The first and second columns depict the 3D folded structures of proteins generated by Chroma and OmegaFold2, respectively, while the third and fourth columns represent the fractional content of secondary structures, and first ten natural frequencies for the generated proteins.
...and 1 more figures

ProtAgents: Protein discovery via large language model multi-agent collaborations combining physics and machine learning

TL;DR

Abstract

ProtAgents: Protein discovery via large language model multi-agent collaborations combining physics and machine learning

Authors

TL;DR

Abstract

Table of Contents

Figures (6)