Active learning for affinity prediction of antibodies

Alexandra Gessner; Sebastian W. Ober; Owen Vickery; Dino Oglić; Talip Uçar

Active learning for affinity prediction of antibodies

Alexandra Gessner, Sebastian W. Ober, Owen Vickery, Dino Oglić, Talip Uçar

TL;DR

The paper presents an active-learning pipeline that couples relative binding free energy (RBFE) simulations with Bayesian optimization to efficiently identify antibody mutations that enhance binding, addressing the combinatorial mutation space and costly physics-based evaluations. It systematically evaluates multiple sequence encodings and Gaussian-process kernels, validating on two precomputed RBFE datasets and then executing a full loop with a larger Schrödinger Res Scan dataset. Results show that encoding choice and kernel type strongly influence sample efficiency, with AbLang2 and the Tanimoto kernel performing well in validation, though the full-loop experiments reveal exploration challenges and benefits to unrestricted search. The work provides a practical framework for accelerating antibody lead optimization and points to future enhancements such as multi-source BO across simulators and integrating structural antibody information to improve scalability and robustness.

Abstract

The primary objective of most lead optimization campaigns is to enhance the binding affinity of ligands. For large molecules such as antibodies, identifying mutations that enhance antibody affinity is particularly challenging due to the combinatorial explosion of potential mutations. When the structure of the antibody-antigen complex is available, relative binding free energy (RBFE) methods can offer valuable insights into how different mutations will impact the potency and selectivity of a drug candidate, thereby reducing the reliance on costly and time-consuming wet-lab experiments. However, accurately simulating the physics of large molecules is computationally intensive. We present an active learning framework that iteratively proposes promising sequences for simulators to evaluate, thereby accelerating the search for improved binders. We explore different modeling approaches to identify the most effective surrogate model for this task, and evaluate our framework both using pre-computed pools of data and in a realistic full-loop setting.

Active learning for affinity prediction of antibodies

TL;DR

Abstract

Paper Structure (21 sections, 1 equation, 11 figures)

This paper contains 21 sections, 1 equation, 11 figures.

Introduction
Methods
Relative binding free energy methods
Bayesian optimization
Encoding antibody sequences
Gaussian processes on sequence embeddings
Active learning on sequence data
Data
Experiments
Validation
NQFEP data
Schrödinger Res Scan data
Full loop
Related work
Conclusion and Outlook
...and 6 more sections

Figures (11)

Figure 1: Schematic of the full loop. In each iteration, a new, previously unseen sequence is proposed by maximizing the acquisition function and fed into the simulator to obtain the corresponding $\Delta\Delta G$ value. Both sequence and simulator output are added to the dataset to update the surrogate model.
Figure 2: Schematic of the validation loop. We split the pre-computed dataset into a training set and a pool of held-out data. The loop iteratively selects a new sequence from the pool by maximizing the acquisition function on the held-out data and updates the surrogate model.
Figure 3: Validation on the NQFEP pre-computed dataset over 200 iterations averaged over 10 runs. Best $\Delta\Delta G$ value found using the RBF (left), Matérn (center), and Tanimoto (right) kernels, respectively, for all encodings. In the case of the RBF and Matérn kernel, the embeddings have been projected to 5 dimensions. The horizontal dashed line is the best value in the dataset.
Figure 4: Validation on the Schrödinger Res Scan pre-computed dataset over 200 iterations averaged over 10 runs. Best $\Delta\Delta G$ value found using the RBF (left), Matérn (center), and Tanimoto (right) kernels, respectively, for all encodings. In the case of the RBF and Matérn kernel, the embeddings have been projected to 5 dimensions. The horizontal dashed line is the best value in the dataset.
Figure 5: Results of the full loop run with the Schrödinger Res Scan simulator. We plot the best $\Delta\Delta G$ values found for each of three encodings using the Tanimoto kernel, as well as the best value from the pooled data from the validation experiments as a dashed line.
...and 6 more figures

Active learning for affinity prediction of antibodies

TL;DR

Abstract

Active learning for affinity prediction of antibodies

Authors

TL;DR

Abstract

Table of Contents

Figures (11)