Table of Contents
Fetching ...

Seek and You Shall Fold

Nadav Bojan Sellam, Meital Bojan, Paul Schanda, Alex Bronstein

TL;DR

The paper tackles the challenge of conditioning protein structure generation on experimental data when predictors are non-differentiable. It introduces a non-differentiable guidance framework that couples a diffusion-based generator (BioEmu) with a tailored genetic algorithm to optimize a black-box score S over generated structures $x = G(z_T, 0)$. Across NOE restraints, pairwise distances, and chemical shifts, the approach improves alignment with experimental data, notably recovering folds consistent with NOE-derived restraints and demonstrating a proof-of-concept for chemical-shift guidance, though the latter does not yet yield fully correct ensembles due to predictor limitations. This work demonstrates a general strategy for integrating diverse, non-differentiable experimental signals into data-conditioned protein modeling, moving beyond the limits of differentiable guidance.

Abstract

Accurate protein structures are essential for understanding biological function, yet incorporating experimental data into protein generative models remains a major challenge. Most predictors of experimental observables are non-differentiable, making them incompatible with gradient-based conditional sampling. This is especially limiting in nuclear magnetic resonance, where rich data such as chemical shifts are hard to directly integrate into generative modeling. We introduce a framework for non-differentiable guidance of protein generative models, coupling a continuous diffusion-based generator with any black-box objective via a tailored genetic algorithm. We demonstrate its effectiveness across three modalities: pairwise distance constraints, nuclear Overhauser effect restraints, and for the first time chemical shifts. These results establish chemical shift guided structure generation as feasible, expose key weaknesses in current predictors, and showcase a general strategy for incorporating diverse experimental signals. Our work points toward automated, data-conditioned protein modeling beyond the limits of differentiability.

Seek and You Shall Fold

TL;DR

The paper tackles the challenge of conditioning protein structure generation on experimental data when predictors are non-differentiable. It introduces a non-differentiable guidance framework that couples a diffusion-based generator (BioEmu) with a tailored genetic algorithm to optimize a black-box score S over generated structures . Across NOE restraints, pairwise distances, and chemical shifts, the approach improves alignment with experimental data, notably recovering folds consistent with NOE-derived restraints and demonstrating a proof-of-concept for chemical-shift guidance, though the latter does not yet yield fully correct ensembles due to predictor limitations. This work demonstrates a general strategy for integrating diverse, non-differentiable experimental signals into data-conditioned protein modeling, moving beyond the limits of differentiable guidance.

Abstract

Accurate protein structures are essential for understanding biological function, yet incorporating experimental data into protein generative models remains a major challenge. Most predictors of experimental observables are non-differentiable, making them incompatible with gradient-based conditional sampling. This is especially limiting in nuclear magnetic resonance, where rich data such as chemical shifts are hard to directly integrate into generative modeling. We introduce a framework for non-differentiable guidance of protein generative models, coupling a continuous diffusion-based generator with any black-box objective via a tailored genetic algorithm. We demonstrate its effectiveness across three modalities: pairwise distance constraints, nuclear Overhauser effect restraints, and for the first time chemical shifts. These results establish chemical shift guided structure generation as feasible, expose key weaknesses in current predictors, and showcase a general strategy for incorporating diverse experimental signals. Our work points toward automated, data-conditioned protein modeling beyond the limits of differentiability.

Paper Structure

This paper contains 44 sections, 14 equations, 14 figures, 1 table, 1 algorithm.

Figures (14)

  • Figure 1: Overview of method. Latents are perturbed and decoded with BioEmu, scored by experimental data, and evolved with a genetic algorithm.
  • Figure 2: Pairwise distance guidance of 4OLE. Alternative conformations of residues $60-68$ were guided from conformation B (blue) toward conformation A (red) using $C_\alpha$ pairwise distance restraints. Without guidance, BioEmu produces structures resembling conformation B, whereas guidance enables recovery of the helical conformation A.
  • Figure 3: NOE-guided structures. Guided BioEmu with NOE-derived restraints on peptides 1DEC and 2LI3. Reference PDBs are in gray. AF3 priors contained misfolded regions inconsistent with NOEs, unguided BioEmu did not match the PDB, while guided BioEmu reduced violations and recovered the experimental folds.
  • Figure 4: Chemical shift guidance metric. BioEmu guided using UCBShift-predicted chemical shifts. Plots show mean absolute error (MAE) of the best structure as a function of optimization cycles. Left: peptide 1DEC with synthetic PDB-derived shifts. Right: protein 1DFU with experimental shifts (BMRB 4395). In both cases, guidance improved the chemical-shift metric, though convergence to the experimental ensemble was not achieved.
  • Figure 5: Best vs. mean metric per cycle. For each target, we plot the best-structure metric (solid) and the mean-structure metric (faint) using the same definitions as in the main text for NOE guidance (Y-axis: fraction of violated restraints $\times$ mean violation distance; X-axis: number of cycles). Dashed baselines indicate PDB and AF3.
  • ...and 9 more figures