Seek and You Shall Fold
Nadav Bojan Sellam, Meital Bojan, Paul Schanda, Alex Bronstein
TL;DR
The paper tackles the challenge of conditioning protein structure generation on experimental data when predictors are non-differentiable. It introduces a non-differentiable guidance framework that couples a diffusion-based generator (BioEmu) with a tailored genetic algorithm to optimize a black-box score S over generated structures $x = G(z_T, 0)$. Across NOE restraints, pairwise distances, and chemical shifts, the approach improves alignment with experimental data, notably recovering folds consistent with NOE-derived restraints and demonstrating a proof-of-concept for chemical-shift guidance, though the latter does not yet yield fully correct ensembles due to predictor limitations. This work demonstrates a general strategy for integrating diverse, non-differentiable experimental signals into data-conditioned protein modeling, moving beyond the limits of differentiable guidance.
Abstract
Accurate protein structures are essential for understanding biological function, yet incorporating experimental data into protein generative models remains a major challenge. Most predictors of experimental observables are non-differentiable, making them incompatible with gradient-based conditional sampling. This is especially limiting in nuclear magnetic resonance, where rich data such as chemical shifts are hard to directly integrate into generative modeling. We introduce a framework for non-differentiable guidance of protein generative models, coupling a continuous diffusion-based generator with any black-box objective via a tailored genetic algorithm. We demonstrate its effectiveness across three modalities: pairwise distance constraints, nuclear Overhauser effect restraints, and for the first time chemical shifts. These results establish chemical shift guided structure generation as feasible, expose key weaknesses in current predictors, and showcase a general strategy for incorporating diverse experimental signals. Our work points toward automated, data-conditioned protein modeling beyond the limits of differentiability.
