Table of Contents
Fetching ...

Enhancing Reinforcement learning in 3-Dimensional Hydrophobic-Polar Protein Folding Model with Attention-based layers

Peizheng Liu, Hitoshi Iba

TL;DR

This paper tackles 3D hydrophobic–polar protein folding by casting folding as a reinforcement learning task within a self-avoiding walk environment and solving it with a Transformer-based DQN enhanced by dueling and double Q-learning, prioritized replay, and symmetry-breaking constraints. The approach represents states as sequences processed by Transformer encoders, yielding a global state representation via a CLS token and leveraging positional encoding to capture residue order. Empirical results on standard benchmarks show competitive performance, achieving best-known values for several sequences and near-optimal results for longer chains, while also revealing sensitivity to hyperparameters and exploration strategies. The work demonstrates the potential of attention-based reinforcement learning for lattice HP folding and highlights clear avenues for future improvements, including systematic ablations, advanced exploration methods, and extensions to 2D/3D folding problems.

Abstract

Transformer-based architectures have recently propelled advances in sequence modeling across domains, but their application to the hydrophobic-hydrophilic (H-P) model for protein folding remains relatively unexplored. In this work, we adapt a Deep Q-Network (DQN) integrated with attention mechanisms (Transformers) to address the 3D H-P protein folding problem. Our system formulates folding decisions as a self-avoiding walk in a reinforced environment, and employs a specialized reward function based on favorable hydrophobic interactions. To improve performance, the method incorporates validity check including symmetry-breaking constraints, dueling and double Q-learning, and prioritized replay to focus learning on critical transitions. Experimental evaluations on standard benchmark sequences demonstrate that our approach achieves several known best solutions for shorter sequences, and obtains near-optimal results for longer chains. This study underscores the promise of attention-based reinforcement learning for protein folding, and created a prototype of Transformer-based Q-network structure for 3-dimensional lattice models.

Enhancing Reinforcement learning in 3-Dimensional Hydrophobic-Polar Protein Folding Model with Attention-based layers

TL;DR

This paper tackles 3D hydrophobic–polar protein folding by casting folding as a reinforcement learning task within a self-avoiding walk environment and solving it with a Transformer-based DQN enhanced by dueling and double Q-learning, prioritized replay, and symmetry-breaking constraints. The approach represents states as sequences processed by Transformer encoders, yielding a global state representation via a CLS token and leveraging positional encoding to capture residue order. Empirical results on standard benchmarks show competitive performance, achieving best-known values for several sequences and near-optimal results for longer chains, while also revealing sensitivity to hyperparameters and exploration strategies. The work demonstrates the potential of attention-based reinforcement learning for lattice HP folding and highlights clear avenues for future improvements, including systematic ablations, advanced exploration methods, and extensions to 2D/3D folding problems.

Abstract

Transformer-based architectures have recently propelled advances in sequence modeling across domains, but their application to the hydrophobic-hydrophilic (H-P) model for protein folding remains relatively unexplored. In this work, we adapt a Deep Q-Network (DQN) integrated with attention mechanisms (Transformers) to address the 3D H-P protein folding problem. Our system formulates folding decisions as a self-avoiding walk in a reinforced environment, and employs a specialized reward function based on favorable hydrophobic interactions. To improve performance, the method incorporates validity check including symmetry-breaking constraints, dueling and double Q-learning, and prioritized replay to focus learning on critical transitions. Experimental evaluations on standard benchmark sequences demonstrate that our approach achieves several known best solutions for shorter sequences, and obtains near-optimal results for longer chains. This study underscores the promise of attention-based reinforcement learning for protein folding, and created a prototype of Transformer-based Q-network structure for 3-dimensional lattice models.

Paper Structure

This paper contains 24 sections, 20 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Overall Training Data Flow. The process begins with an empty folding environment, where the initial state $s_0$ is obtained by the agent. The agent iteratively follows a state-action-reward cycle until an episode terminates. After termination, the reward is backpropagated through each step and the environment is reset empty for the next episode. Training of the policy network is performed on a batch of samples from the replay memory after each step. The target network is periodically updated by copying the parameters of the policy network.
  • Figure 2: Three steps for canceling symmetrical structures. Step 1: Eliminate the first action, canceling 6 central symmetrical structures around (0,0,0). Step 2: Eliminate the first non-forward action, canceling 4 rotational symmetrical structures around the x-axis. Step 3: Eliminate the first up-or-down action, canceling 2 mirror-image structures around the xy-plane.
  • Figure 3: Invalid move scenarios for a sequence of length 13 on a 2D lattice. In scenario (a), the depicted move exceeds the outbound limit of $\frac{length}{2} = 6.5$. In scenario (b), an overlap occurs at the amino acid positions $(0,0)$ or $(1,1)$ for the two depicted moves. In scenario (c), no further moves can be made without resulting in a collision after the depicted move. In scenario (d), the depicted move is currently valid and not trapped, but prevents the successful placement of all 13 amino acids in subsequent steps.
  • Figure 4: Training Curves for Sequences 1 and 7. The reward is assigned as a positive value equivalent to the absolute value of the energy value. In (a), the evaluation for Sequence 1 converged to an optimal value of -11 but subsequently degraded after convergence. In (b), the evaluation metric for Sequence 7 failed to stabilize at any specific value, and no evaluated model attained the optimal value of -49 identified during training.
  • Figure 5: Optimal structures identified in Sequence 1. All structures were found during simulation episodes. Structure (b) diverges from the others at the 9th action, while (d) differs from (a) and (c) beginning at the 14th action. Further differences occur at the 15th action between (c) and (d).