Model-based reinforcement learning for protein backbone design

Frederic Renard; Cyprien Courtot; Alfredo Reichlin; Oliver Bent

Model-based reinforcement learning for protein backbone design

Frederic Renard, Cyprien Courtot, Alfredo Reichlin, Oliver Bent

TL;DR

The paper tackles inverse design of protein backbones that meet predefined icosahedral shapes and structural-score thresholds. It applies AlphaZero, a model-based reinforcement learning approach with Monte Carlo Tree Search, to sequentially assemble backbones using helices and loops, comparing a sigmoid reward with a novel threshold-based reward and introducing side-objectives to regularize learning. The key contributions are: (i) demonstrating superior performance of AlphaZero over the prior MCTS baseline, (ii) showing that the threshold reward yields better learning than the sigmoid formulation, and (iii) demonstrating that adding side-objectives further enhances multi-objective design quality. This work paves the way for scalable, traceable multi-objective protein backbone design and suggests avenues for extending to other shapes, sequence design, and structure validation with predictive tools like AlphaFold.

Abstract

Designing protein nanomaterials of predefined shape and characteristics has the potential to dramatically impact the medical industry. Machine learning (ML) has proven successful in protein design, reducing the need for expensive wet lab experiment rounds. However, challenges persist in efficiently exploring the protein fitness landscapes to identify optimal protein designs. In response, we propose the use of AlphaZero to generate protein backbones, meeting shape and structural scoring requirements. We extend an existing Monte Carlo tree search (MCTS) framework by incorporating a novel threshold-based reward and secondary objectives to improve design precision. This innovation considerably outperforms existing approaches, leading to protein backbones that better respect structural scores. The application of AlphaZero is novel in the context of protein backbone design and demonstrates promising performance. AlphaZero consistently surpasses baseline MCTS by more than 100% in top-down protein design tasks. Additionally, our application of AlphaZero with secondary objectives uncovers further promising outcomes, indicating the potential of model-based reinforcement learning (RL) in navigating the intricate and nuanced aspects of protein design

Model-based reinforcement learning for protein backbone design

TL;DR

Abstract

Paper Structure (24 sections, 4 equations, 9 figures, 4 tables)

This paper contains 24 sections, 4 equations, 9 figures, 4 tables.

Introduction
Methods
Markov Decision Process
State and action spaces
Reward
Episodes
AlphaZero for protein backbone design
AlphaZero algorithm
AlphaZero algorithm with side-objectives
Implementation
Results
Benchmark of MCTS against AlphaZero
Motivation and design
Results
Benchmark of AlphaZero with and without side-objectives
...and 9 more sections

Figures (9)

Figure 1: Diagram of the AlphaZero algorithm action selection process. Starting from the root node, the tree of states and actions is expanded by the repetition of the select, expand and evaluate, and backup phases. First, a new child node is selected by maximizing $P(a|s) = Q(s, a) + c_{puct} P(s,a) \frac{\sqrt{\sum_b N(s, b)}}{1 + N(s, a)}$ with $P(s,a)$ the policy network output, $Q(s,a)$ the mean action value of $(s,a)$ and $N(s,a)$ the number of visits of $(s,a)$. In the second phase, this new child node is evaluated by the neural network $f_\theta (s) = (P(s,a), V(s))$ with $V(s)$ the value network output. In the third phase, the value estimate $V(s)$ is used to update the $Q$ values for the parent nodes. After a number of simulations, an action is selected according to $\boldsymbol{\pi}(a|s) = \frac{N(s,a)^{1/\tau}}{\sum_bN(s,b)^{1/\tau}}$. Once a terminal state is reached, $(\boldsymbol{\pi}, r_T, s_T)$ are stored in a buffer.
Figure 2: Protein score distributions means with 95.0% bootstrap confidence intervals. AlphaZero, and more specifically AlphaZero (thresholds) systematically outperforms on all scores.
Figure 3: of the reward of both algorithms at the first epoch and at epoch 40.0 of training. AlphaZero (side-objectives) consistently achieves higher rewards compared to the AlphaZero (original).
Figure 4: Rewards of the AlphaZero agents throughout training. Both AlphaZero (side-objectives) maximum and mean rewards are consistently higher than those of AlphaZero (original).
Figure 5: Mixture of experts architecture for AlphaZero. Circles represent linear layers. are used to predict the core and interface designability score and for the other scores. The hidden states used to compute the scores are concatenated and used by two different heads : a policy and a value head. The neural network output the policy, the value and the five different protein structure scores.
...and 4 more figures

Model-based reinforcement learning for protein backbone design

TL;DR

Abstract

Model-based reinforcement learning for protein backbone design

Authors

TL;DR

Abstract

Table of Contents

Figures (9)