Table of Contents
Fetching ...

ProteinZero: Self-Improving Protein Generation via Online Reinforcement Learning

Ziwen Wang, Jiajun Fan, Ruihan Guo, Thao Nguyen, Heng Ji, Ge Liu

TL;DR

The results indicate that efficient online RL fine-tuning can complement supervised pretraining by allowing protein generative models to evolve continuously from their own outputs and optimize multiple design objectives without labeled data, opening new possibilities for exploring the vast protein design space.

Abstract

Protein generative models have shown remarkable promise in protein design, yet their success rates remain constrained by reliance on curated sequence-structure datasets and by misalignment between supervised objectives and real design goals. We present ProteinZero, an online reinforcement learning framework for inverse folding models that enables scalable, automated, and continuous self-improvement with computationally efficient feedback. ProteinZero employs a reward pipeline that combines structural guidance from ESMFold with a novel self-derived ddG predictor, providing stable multi-objective signals while avoiding the prohibitive cost of physics-based methods. To ensure robustness in online RL, we further introduce a novel embedding-level diversity regularizer that mitigates mode collapse and promotes functionally meaningful sequence variation. Within a general RL formulation balancing multi-reward optimization, KL-divergence from a reference model, and diversity regularization, ProteinZero achieves robust improvements across designability, stability, recovery, and diversity. On the CATH-4.3 benchmark, it consistently outperforms state-of-the-art baselines including ProteinMPNN, ESM-IF, and InstructPLM, reducing design failure rates by 36-48% and achieving success rates above 90% across diverse folds. Importantly, a complete RL run can be executed on a single 8 X GPU node within three days, including reward computation and data generation. These results indicate that efficient online RL fine-tuning can complement supervised pretraining by allowing protein generative models to evolve continuously from their own outputs and optimize multiple design objectives without labeled data, opening new possibilities for exploring the vast protein design space. Full source code and model checkpoints will be released upon publication.

ProteinZero: Self-Improving Protein Generation via Online Reinforcement Learning

TL;DR

The results indicate that efficient online RL fine-tuning can complement supervised pretraining by allowing protein generative models to evolve continuously from their own outputs and optimize multiple design objectives without labeled data, opening new possibilities for exploring the vast protein design space.

Abstract

Protein generative models have shown remarkable promise in protein design, yet their success rates remain constrained by reliance on curated sequence-structure datasets and by misalignment between supervised objectives and real design goals. We present ProteinZero, an online reinforcement learning framework for inverse folding models that enables scalable, automated, and continuous self-improvement with computationally efficient feedback. ProteinZero employs a reward pipeline that combines structural guidance from ESMFold with a novel self-derived ddG predictor, providing stable multi-objective signals while avoiding the prohibitive cost of physics-based methods. To ensure robustness in online RL, we further introduce a novel embedding-level diversity regularizer that mitigates mode collapse and promotes functionally meaningful sequence variation. Within a general RL formulation balancing multi-reward optimization, KL-divergence from a reference model, and diversity regularization, ProteinZero achieves robust improvements across designability, stability, recovery, and diversity. On the CATH-4.3 benchmark, it consistently outperforms state-of-the-art baselines including ProteinMPNN, ESM-IF, and InstructPLM, reducing design failure rates by 36-48% and achieving success rates above 90% across diverse folds. Importantly, a complete RL run can be executed on a single 8 X GPU node within three days, including reward computation and data generation. These results indicate that efficient online RL fine-tuning can complement supervised pretraining by allowing protein generative models to evolve continuously from their own outputs and optimize multiple design objectives without labeled data, opening new possibilities for exploring the vast protein design space. Full source code and model checkpoints will be released upon publication.

Paper Structure

This paper contains 55 sections, 5 theorems, 17 equations, 8 figures, 11 tables.

Key Result

Lemma F.4

Under Assumption as:ac, with $Z=\psi_\theta(X,Y)$ on the unit sphere, the diversity term is the squared norm of the mean embedding: $\mathbb{E}_{y,y'\sim p}[c(y,y')]=\|\mathbb{E}_{y\sim p}[Z]\|_2^2$. Consequently, the objective is concave in $p$. It is strictly concave on the relative interior if $\alpha_{\mathrm{KL}}>0$.

Figures (8)

  • Figure 1: ProteinZero framework.Upper: Online RL components: ESMFold-based designability (TM-score via US-Align), $\Delta\Delta G$ predictor using backbone-conditioned likelihoods, and embedding diversity regularization. Lower: Iterative training where inverse folding models generate sequences, receive multi-objective rewards, and update with KL constraints and diversity regularization. Held-out CATH-4.3 evaluation demonstrates substantial improvements across all key design metrics.
  • Figure 2: Performance comparison across seven evaluation metrics (Recovery Rate, Stability, TM Score, pLDDT, Diversity, scRMSD <2Å%, and Success Rate) for 0-150 residue proteins (left) and 150-300 residue proteins (right). ProteinZero variants achieve the highest across all metrics.
  • Figure 3: Representative cases of protein structure designs from held-out test set. Visual comparison between ProteinZero (cyan), native proteins (pink), and InstructPLM (lime green). Top panels show selected cases where naturally unstable proteins are redesigned by ProteinZero. In these examples, predicted stability improvements range from 233% to 858% (based on FoldX ddG calculations) while maintaining structural similarity (TM-scores > 0.95). Bottom panels present comparative examples with InstructPLM for challenging $\beta$-rich structures and complex architectures. In the shown cases, ProteinZero generates designs with negative predicted ddG values while InstructPLM produces positive values indicating predicted instability. These visualizations represent individual design outcomes; comprehensive quantitative results are provided in Table \ref{['tab:protein_design']}.
  • Figure 4: Fast-ddG predictor performance on the Ssym dataset with 342 wet-lab validated single-point mutations (wild-type $\rightarrow$ mutant). Each subfigure shows predicted versus experimental $\Delta\Delta G$ values for different model variants: (a) pretrained model before RL fine-tuning, (b) fine-tuned with joint TM-score + Fast-ddG rewards, (c) fine-tuned with Fast-ddG reward only. All variants achieve comparable correlation with experimental measurements (PCC $\approx$ 0.60–0.62, RMSE $\approx$ 1.44–1.47 kcal/mol).
  • Figure 5: Integration of ProteinZero within the AI-driven protein design pipeline. Pre-trained generative models evolve through ProteinZero's online reinforcement learning framework to produce optimized protein sequences. These AI-designed candidates proceed to laboratory synthesis and experimental characterization, enabling applications in diverse biotechnological domains such as enzyme engineering and therapeutic development. The computational stages (blue) can leverage GPU parallelization for efficient large-scale processing.
  • ...and 3 more figures

Theorems & Definitions (10)

  • Remark F.2: Setting and scope of analysis
  • Remark F.3
  • Lemma F.4: Diversity as a penalty on the embedding mean
  • Proposition F.5: Interior fixed point with a non-local repulsive potential
  • Theorem F.6: KL barrier to deterministic collapse
  • proof
  • Proposition F.7: No-KL case: finite condition that rules out a delta optimum
  • Corollary F.8: Readable sufficient condition
  • Remark F.9: Scope of the diversity term
  • Remark F.10: Mini-batch estimator