Table of Contents
Fetching ...

Swarms of Large Language Model Agents for Protein Sequence Design with Experimental Validation

Fiona Y. Wang, Di Sheng Lee, David L. Kaplan, Markus J. Buehler

TL;DR

The paper tackles the challenge of de novo protein design by introducing a decentralized swarm of LLM agents, each responsible for a residue position, to iteratively propose mutations guided by objectives, memory, and local context. This no-training framework achieves objective-directed designs across structural motifs, physicochemical properties, and multi-domain functions, validated experimentally for secondary structure content and with comprehensive in silico metrics. Key contributions include a four-phase design loop, memory-enabled learning, and comparative analyses showing tunable search dynamics across LLMs, along with efficient inference that omits fine-tuning. The approach demonstrates robust design versatility and computational efficiency, offering a generalizable paradigm for biomolecular design beyond proteins.

Abstract

Designing proteins de novo with tailored structural, physicochemical, and functional properties remains a grand challenge in biotechnology, medicine, and materials science, due to the vastness of sequence space and the complex coupling between sequence, structure, and function. Current state-of-the-art generative methods, such as protein language models (PLMs) and diffusion-based architectures, often require extensive fine-tuning, task-specific data, or model reconfiguration to support objective-directed design, thereby limiting their flexibility and scalability. To overcome these limitations, we present a decentralized, agent-based framework inspired by swarm intelligence for de novo protein design. In this approach, multiple large language model (LLM) agents operate in parallel, each assigned to a specific residue position. These agents iteratively propose context-aware mutations by integrating design objectives, local neighborhood interactions, and memory and feedback from previous iterations. This position-wise, decentralized coordination enables emergent design of diverse, well-defined sequences without reliance on motif scaffolds or multiple sequence alignments, validated with experiments on proteins with alpha helix and coil structures. Through analyses of residue conservation, structure-based metrics, and sequence convergence and embeddings, we demonstrate that the framework exhibits emergent behaviors and effective navigation of the protein fitness landscape. Our method achieves efficient, objective-directed designs within a few GPU-hours and operates entirely without fine-tuning or specialized training, offering a generalizable and adaptable solution for protein design. Beyond proteins, the approach lays the groundwork for collective LLM-driven design across biomolecular systems and other scientific discovery tasks.

Swarms of Large Language Model Agents for Protein Sequence Design with Experimental Validation

TL;DR

The paper tackles the challenge of de novo protein design by introducing a decentralized swarm of LLM agents, each responsible for a residue position, to iteratively propose mutations guided by objectives, memory, and local context. This no-training framework achieves objective-directed designs across structural motifs, physicochemical properties, and multi-domain functions, validated experimentally for secondary structure content and with comprehensive in silico metrics. Key contributions include a four-phase design loop, memory-enabled learning, and comparative analyses showing tunable search dynamics across LLMs, along with efficient inference that omits fine-tuning. The approach demonstrates robust design versatility and computational efficiency, offering a generalizable paradigm for biomolecular design beyond proteins.

Abstract

Designing proteins de novo with tailored structural, physicochemical, and functional properties remains a grand challenge in biotechnology, medicine, and materials science, due to the vastness of sequence space and the complex coupling between sequence, structure, and function. Current state-of-the-art generative methods, such as protein language models (PLMs) and diffusion-based architectures, often require extensive fine-tuning, task-specific data, or model reconfiguration to support objective-directed design, thereby limiting their flexibility and scalability. To overcome these limitations, we present a decentralized, agent-based framework inspired by swarm intelligence for de novo protein design. In this approach, multiple large language model (LLM) agents operate in parallel, each assigned to a specific residue position. These agents iteratively propose context-aware mutations by integrating design objectives, local neighborhood interactions, and memory and feedback from previous iterations. This position-wise, decentralized coordination enables emergent design of diverse, well-defined sequences without reliance on motif scaffolds or multiple sequence alignments, validated with experiments on proteins with alpha helix and coil structures. Through analyses of residue conservation, structure-based metrics, and sequence convergence and embeddings, we demonstrate that the framework exhibits emergent behaviors and effective navigation of the protein fitness landscape. Our method achieves efficient, objective-directed designs within a few GPU-hours and operates entirely without fine-tuning or specialized training, offering a generalizable and adaptable solution for protein design. Beyond proteins, the approach lays the groundwork for collective LLM-driven design across biomolecular systems and other scientific discovery tasks.

Paper Structure

This paper contains 12 sections, 8 equations, 14 figures, 1 table.

Figures (14)

  • Figure 1: a. Comparison of swarm framework with conventional single-model framework. b. Multiple LLM agents share the same design objective and reasoning hubs, each proposing mutations for a single residue position, producing updated local context and evaluation feedback. c. Starting with a design objective and input sequence, the agents propose mutations for each residue position, producing an updated sequence which is evaluated. The previously proposed sequences are stored in memory for future iterations. The local context, memory history, and evaluation feedback are used to guide the next round of mutations. d. Input prompt consists of the agent's role and task, local neighborhood and context, design goal and energy, and memory history. Output consists of reasoning and the proposed mutation.
  • Figure 2: Design objective, start sequence, best sequence, its respective 3D structure, and sequence logo returned from 64 iterations with GPT-4o for four structural design objectives.
  • Figure 3: CD spectra of the best sequence for the a. hydrophilic helix and the b. coil sequence.
  • Figure 4: Evolution of calculated Rosetta energyRN34 (red) and Structure Score (blue) over 64 iterations with GPT-4o for the design objective: choose residues that mirror their left and right neighbors to promote local symmetry. This plot visualizes the dynamic interplay between convergence (blue shaded regions, where Structure Score stabilizes at a high level and Energy stabilizes at a low level) and exploration (orange shaded regions, where Energy fluctuates more significantly) during the iterative design process.
  • Figure 5: Design objective, start sequence, best sequence, evidence of objective achievement, and sequence logo returned from 16 iterations with GPT-4o for three diverse design objectives.
  • ...and 9 more figures