Stable Online and Offline Reinforcement Learning for Antibody CDRH3 Design

Yannick Vogt; Mehdi Naouar; Maria Kalweit; Christoph Cornelius Miething; Justus Duyster; Roland Mertelsmann; Gabriel Kalweit; Joschka Boedecker

Stable Online and Offline Reinforcement Learning for Antibody CDRH3 Design

Yannick Vogt, Mehdi Naouar, Maria Kalweit, Christoph Cornelius Miething, Justus Duyster, Roland Mertelsmann, Gabriel Kalweit, Joschka Boedecker

TL;DR

The paper tackles the antibody design problem for the CDRH3 region under a massive search space of $20^L$ sequences with $L=11$, proposing a reinforcement learning framework usable in both online and offline settings. It introduces an offline-capable, stable RL approach that combines Maxmin ensembles and an attention-based Q-network, along with Fitness Buffer replay and nonlinear reward scaling to address overestimation and epistasis. The method achieves state-of-the-art binding energies on the Absolut! benchmark across eight antigens in both online and offline evaluations, demonstrating robust convergence and data-efficient learning from pre-collected datasets. This work enables practical antibody design with pre-existing data and paves the way for antigen-specific design by extending the modeling of biophysical properties.

Abstract

The field of antibody-based therapeutics has grown significantly in recent years, with targeted antibodies emerging as a potentially effective approach to personalized therapies. Such therapies could be particularly beneficial for complex, highly individual diseases such as cancer. However, progress in this field is often constrained by the extensive search space of amino acid sequences that form the foundation of antibody design. In this study, we introduce a novel reinforcement learning method specifically tailored to address the unique challenges of this domain. We demonstrate that our method can learn the design of high-affinity antibodies against multiple targets in silico, utilizing either online interaction or offline datasets. To the best of our knowledge, our approach is the first of its kind and outperforms existing methods on all tested antigens in the Absolut! database.

Stable Online and Offline Reinforcement Learning for Antibody CDRH3 Design

TL;DR

The paper tackles the antibody design problem for the CDRH3 region under a massive search space of

sequences with

, proposing a reinforcement learning framework usable in both online and offline settings. It introduces an offline-capable, stable RL approach that combines Maxmin ensembles and an attention-based Q-network, along with Fitness Buffer replay and nonlinear reward scaling to address overestimation and epistasis. The method achieves state-of-the-art binding energies on the Absolut! benchmark across eight antigens in both online and offline evaluations, demonstrating robust convergence and data-efficient learning from pre-collected datasets. This work enables practical antibody design with pre-existing data and paves the way for antigen-specific design by extending the modeling of biophysical properties.

Abstract

Paper Structure (20 sections, 6 figures, 1 table, 1 algorithm)

This paper contains 20 sections, 6 figures, 1 table, 1 algorithm.

Introduction
Background
Markov Decision Process
Online and Offline Reinforcement Learning
Stable Reinforcement Learning for Antibody Design
Favor Exploration -- Replay the Fittest
Stabilizing Reinforcement Learning with Ensembles
Attention-based Q-Networks
Reward Scaling for Binding Energy
Results and Discussion
Online Learning
CDRH3 Analysis
Offline Learning
Conclusion
Architecture
...and 5 more sections

Figures (6)

Figure 1: Visualization of our method on fictive CDRH3 sequences of length four. Our method repeatedly designs CDRH3 sequences (1), and evaluates them using the Absolut! software (2), stores the gathered data (3), and updates its Q-function and thereby policy (4). We highlight important components of our method in orange.
Figure 2: Progress of the binding energy of our method on different antigens, colored as displayed in the provided legend. The left plot depicts the online setting, while the offline setting is shown in the right plot. Mean and two standard deviations over eight seeds are shown.
Figure 3: Amino acid frequency per position in the top 0.01% CDRH3 sequences in the Absolut! database (left) and in CDRH3 sequences, reaching at least the same affinity, discovered in our experiments (right). Colors indicate the aa polarity.
Figure 4: Visual summary of our network architecture.
Figure 5: Progress of energy scores of our method and different ablation runs on the antigen 2DD8_S. Mean and two standard deviations over eight seeds are displayed. "Ours" represents our method with all its components. We refer to our method without a high rr, Fitness Buffer, and Scaling as "w/o all" as that reflects our method without all its components. On the left, we visualize the convergence of our method in comparison to dqn both with and without all our components. On the right, we visualize the effect of removing a subsection of the components.
...and 1 more figures

Stable Online and Offline Reinforcement Learning for Antibody CDRH3 Design

TL;DR

Abstract

Stable Online and Offline Reinforcement Learning for Antibody CDRH3 Design

Authors

TL;DR

Abstract

Table of Contents

Figures (6)