Table of Contents
Fetching ...

Scaling Atomistic Protein Binder Design with Generative Pretraining and Test-Time Compute

Kieran Didi, Zuobai Zhang, Guoqing Zhou, Danny Reidenbach, Zhonglin Cao, Sooyoung Cha, Tomas Geffner, Christian Dallago, Jian Tang, Michael M. Bronstein, Martin Steinegger, Emine Kucukbenli, Arash Vahdat, Karsten Kreis

Abstract

Protein interaction modeling is central to protein design, which has been transformed by machine learning with applications in drug discovery and beyond. In this landscape, structure-based de novo binder design is cast as either conditional generative modeling or sequence optimization via structure predictors ("hallucination"). We argue that this is a false dichotomy and propose Proteina-Complexa, a novel fully atomistic binder generation method unifying both paradigms. We extend recent flow-based latent protein generation architectures and leverage the domain-domain interactions of monomeric computationally predicted protein structures to construct Teddymer, a new large-scale dataset of synthetic binder-target pairs for pretraining. Combined with high-quality experimental multimers, this enables training a strong base model. We then perform inference-time optimization with this generative prior, unifying the strengths of previously distinct generative and hallucination methods. Proteina-Complexa sets a new state of the art in computational binder design benchmarks: it delivers markedly higher in-silico success rates than existing generative approaches, and our novel test-time optimization strategies greatly outperform previous hallucination methods under normalized compute budgets. We also demonstrate interface hydrogen bond optimization, fold class-guided binder generation, and extensions to small molecule targets and enzyme design tasks, again surpassing prior methods. Code, models and new data will be publicly released.

Scaling Atomistic Protein Binder Design with Generative Pretraining and Test-Time Compute

Abstract

Protein interaction modeling is central to protein design, which has been transformed by machine learning with applications in drug discovery and beyond. In this landscape, structure-based de novo binder design is cast as either conditional generative modeling or sequence optimization via structure predictors ("hallucination"). We argue that this is a false dichotomy and propose Proteina-Complexa, a novel fully atomistic binder generation method unifying both paradigms. We extend recent flow-based latent protein generation architectures and leverage the domain-domain interactions of monomeric computationally predicted protein structures to construct Teddymer, a new large-scale dataset of synthetic binder-target pairs for pretraining. Combined with high-quality experimental multimers, this enables training a strong base model. We then perform inference-time optimization with this generative prior, unifying the strengths of previously distinct generative and hallucination methods. Proteina-Complexa sets a new state of the art in computational binder design benchmarks: it delivers markedly higher in-silico success rates than existing generative approaches, and our novel test-time optimization strategies greatly outperform previous hallucination methods under normalized compute budgets. We also demonstrate interface hydrogen bond optimization, fold class-guided binder generation, and extensions to small molecule targets and enzyme design tasks, again surpassing prior methods. Code, models and new data will be publicly released.

Paper Structure

This paper contains 66 sections, 11 equations, 32 figures, 14 tables, 4 algorithms.

Figures (32)

  • Figure 1: (Top)Proteína-Complexa's target-conditioned generation process. (Bottom) Scaling test-time compute, we use Complexa's generative prior for more efficient optimization than previous hallucination methods (\ref{['sec:inference_time_opt']}). We depict beam search, which steers stochastic generation toward high-quality binders, guided by structure prediction models' interface scores or hydrogen bond energies. Intermediate candidate states (blue) are scored via rollouts (blue, dotted), promising candidates are kept, and new trajectories are launched (orange).
  • Figure 2: Binders generated by Complexa, passing in-silico success criteria (more visualizations in \ref{['app:more_visualizations']}). (a) TNF-$\alpha$ three-chain target. (b) Claudin-1 target, red interface hydrogen bonds. (c) OQO small molecule target.
  • Figure 3: Teddymer dimers resemble realistic binder-target structures, including interface hydrogen bonding (zoom-in). Also see \ref{['app:teddymer']}.
  • Figure 4: Filtered training datasets used by Complexa.
  • Figure 5: Complexa's latent target conditioning. When training the conditional denoiser, the encoder and decoder are frozen.
  • ...and 27 more figures