Active learning for affinity prediction of antibodies
Alexandra Gessner, Sebastian W. Ober, Owen Vickery, Dino Oglić, Talip Uçar
TL;DR
The paper presents an active-learning pipeline that couples relative binding free energy (RBFE) simulations with Bayesian optimization to efficiently identify antibody mutations that enhance binding, addressing the combinatorial mutation space and costly physics-based evaluations. It systematically evaluates multiple sequence encodings and Gaussian-process kernels, validating on two precomputed RBFE datasets and then executing a full loop with a larger Schrödinger Res Scan dataset. Results show that encoding choice and kernel type strongly influence sample efficiency, with AbLang2 and the Tanimoto kernel performing well in validation, though the full-loop experiments reveal exploration challenges and benefits to unrestricted search. The work provides a practical framework for accelerating antibody lead optimization and points to future enhancements such as multi-source BO across simulators and integrating structural antibody information to improve scalability and robustness.
Abstract
The primary objective of most lead optimization campaigns is to enhance the binding affinity of ligands. For large molecules such as antibodies, identifying mutations that enhance antibody affinity is particularly challenging due to the combinatorial explosion of potential mutations. When the structure of the antibody-antigen complex is available, relative binding free energy (RBFE) methods can offer valuable insights into how different mutations will impact the potency and selectivity of a drug candidate, thereby reducing the reliance on costly and time-consuming wet-lab experiments. However, accurately simulating the physics of large molecules is computationally intensive. We present an active learning framework that iteratively proposes promising sequences for simulators to evaluate, thereby accelerating the search for improved binders. We explore different modeling approaches to identify the most effective surrogate model for this task, and evaluate our framework both using pre-computed pools of data and in a realistic full-loop setting.
