Table of Contents
Fetching ...

Adaptive Protein Design Protocols and Middleware

Aymen Alsaadi, Jonathan Ash, Mikhail Titov, Matteo Turilli, Andre Merzky, Shantenu Jha, Sagar Khare

TL;DR

The paper tackles the challenge of efficiently navigating the vast protein design space by introducing IMPRESS, a framework that tightly couples AI-driven sequence generation with HPC simulations in real time. It implements an adaptive design protocol using ProteinMPNN for sequence generation and AlphaFold for structure prediction, orchestrated by RADICAL-Pilot to enable asynchronous, workload-aware execution. Key contributions include an adaptive pipeline that improves design quality as measured by metrics such as $pLDDT$, $pTM$, and $pAE$, and markedly higher resource utilization compared to non-adaptive baselines. The approach generalizes beyond a single use case and lays the groundwork for scalable AI-HPC workflows, potentially extending to proteases and foundation-model evaluation in protein design.

Abstract

Computational protein design is experiencing a transformation driven by AI/ML. However, the range of potential protein sequences and structures is astronomically vast, even for moderately sized proteins. Hence, achieving convergence between generated and predicted structures demands substantial computational resources for sampling. The Integrated Machine-learning for Protein Structures at Scale (IMPRESS) offers methods and advanced computing systems for coupling AI to high-performance computing tasks, enabling the ability to evaluate the effectiveness of protein designs as they are developed, as well as the models and simulations used to generate data and train models. This paper introduces IMPRESS and demonstrates the development and implementation of an adaptive protein design protocol and its supporting computing infrastructure. This leads to increased consistency in the quality of protein design and enhanced throughput of protein design due to dynamic resource allocation and asynchronous workload execution.

Adaptive Protein Design Protocols and Middleware

TL;DR

The paper tackles the challenge of efficiently navigating the vast protein design space by introducing IMPRESS, a framework that tightly couples AI-driven sequence generation with HPC simulations in real time. It implements an adaptive design protocol using ProteinMPNN for sequence generation and AlphaFold for structure prediction, orchestrated by RADICAL-Pilot to enable asynchronous, workload-aware execution. Key contributions include an adaptive pipeline that improves design quality as measured by metrics such as , , and , and markedly higher resource utilization compared to non-adaptive baselines. The approach generalizes beyond a single use case and lays the groundwork for scalable AI-HPC workflows, potentially extending to proteases and foundation-model evaluation in protein design.

Abstract

Computational protein design is experiencing a transformation driven by AI/ML. However, the range of potential protein sequences and structures is astronomically vast, even for moderately sized proteins. Hence, achieving convergence between generated and predicted structures demands substantial computational resources for sampling. The Integrated Machine-learning for Protein Structures at Scale (IMPRESS) offers methods and advanced computing systems for coupling AI to high-performance computing tasks, enabling the ability to evaluate the effectiveness of protein designs as they are developed, as well as the models and simulations used to generate data and train models. This paper introduces IMPRESS and demonstrates the development and implementation of an adaptive protein design protocol and its supporting computing infrastructure. This leads to increased consistency in the quality of protein design and enhanced throughput of protein design due to dynamic resource allocation and asynchronous workload execution.

Paper Structure

This paper contains 11 sections, 5 figures, 1 table.

Figures (5)

  • Figure 1: IMPRESS design and execution sequence
  • Figure 2: Comparison of AlphaFold pLDDT (Left; higher is better), pTM (Center; higher is better), and Interchain pAE (Right; lower is better) between CONT-V and IM-RP pipelines. Bars show median values for each metric across 4 PDZ-peptide structures, with CONT-V in red and IM-RP in green. Error bars represent half a standard deviation.
  • Figure 3: Achieved AlphaFold pLDDT (Left; higher is better), pTM (Center; higher is better), and Interchain pAE (Right; lower is better) by the expanded IM-RP workflow. Bars show median values for each metric across 70 PDZ-peptide structures. Error bars represent half a standard deviation.
  • Figure 4: CONT-V total GPU/CPU resource utilization and execution time.
  • Figure 5: IM-RP total GPU/CPU utilization and execution time. Bootstrap: RP startup time. Exec setup: time for RP to prepare task execution (including script creation and sandbox setup; time varies depending on the file system). Running: task execution time on assigned resources.