Table of Contents
Fetching ...

AntibodyFlow: Normalizing Flow Model for Designing Antibody Complementarity-Determining Regions

Bohao Xu, Yanbo Wang, Wenyu Chen, Shimin Shan

TL;DR

AntibodyFlow addresses the challenge of designing 3D antibody CDR loops by representing a loop as a distance matrix $\mathbf{D}$ and amino-acid sequence $\mathbf{S}$, and modeling their joint distribution with a two-phase normalizing flow: $f_{\mathbf{D}}$ generates $\mathbf{D}$ and $f_{\mathbf{S}|\mathbf{D}}$ generates $\mathbf{S}$ conditioned on $\mathbf{D}$. A differentiable constraint-learning component enforces bond-length and open-loop validity, while a constrained coordinate-generation step reconstructs 3D coordinates $\mathbf{G}$ from $\mathbf{D}$ under these geometric constraints. Empirical results on SabDab and CoV-AbDab demonstrate that AntibodyFlow achieves higher validity rates and lower RMSD than baselines, with up to a 16.0% relative VR improvement and 24.3% RMSD reduction, and it yields better SARS-CoV-2 neutralization predictions. Overall, the work shows that combining distance-based geometric priors, conditional sequence generation, and geometry-aware optimization can substantially advance de novo antibody design with practical therapeutic implications.

Abstract

Therapeutic antibodies have been extensively studied in drug discovery and development in the past decades. Antibodies are specialized protective proteins that bind to antigens in a lock-to-key manner. The binding strength/affinity between an antibody and a specific antigen is heavily determined by the complementarity-determining regions (CDRs) on the antibodies. Existing machine learning methods cast in silico development of CDRs as either sequence or 3D graph (with a single chain) generation tasks and have achieved initial success. However, with CDR loops having specific geometry shapes, learning the 3D geometric structures of CDRs remains a challenge. To address this issue, we propose AntibodyFlow, a 3D flow model to design antibody CDR loops. Specifically, AntibodyFlow first constructs the distance matrix, then predicts amino acids conditioned on the distance matrix. Also, AntibodyFlow conducts constraint learning and constrained generation to ensure valid 3D structures. Experimental results indicate that AntibodyFlow outperforms the best baseline consistently with up to 16.0% relative improvement in validity rate and 24.3% relative reduction in geometric graph level error (root mean square deviation, RMSD).

AntibodyFlow: Normalizing Flow Model for Designing Antibody Complementarity-Determining Regions

TL;DR

AntibodyFlow addresses the challenge of designing 3D antibody CDR loops by representing a loop as a distance matrix and amino-acid sequence , and modeling their joint distribution with a two-phase normalizing flow: generates and generates conditioned on . A differentiable constraint-learning component enforces bond-length and open-loop validity, while a constrained coordinate-generation step reconstructs 3D coordinates from under these geometric constraints. Empirical results on SabDab and CoV-AbDab demonstrate that AntibodyFlow achieves higher validity rates and lower RMSD than baselines, with up to a 16.0% relative VR improvement and 24.3% RMSD reduction, and it yields better SARS-CoV-2 neutralization predictions. Overall, the work shows that combining distance-based geometric priors, conditional sequence generation, and geometry-aware optimization can substantially advance de novo antibody design with practical therapeutic implications.

Abstract

Therapeutic antibodies have been extensively studied in drug discovery and development in the past decades. Antibodies are specialized protective proteins that bind to antigens in a lock-to-key manner. The binding strength/affinity between an antibody and a specific antigen is heavily determined by the complementarity-determining regions (CDRs) on the antibodies. Existing machine learning methods cast in silico development of CDRs as either sequence or 3D graph (with a single chain) generation tasks and have achieved initial success. However, with CDR loops having specific geometry shapes, learning the 3D geometric structures of CDRs remains a challenge. To address this issue, we propose AntibodyFlow, a 3D flow model to design antibody CDR loops. Specifically, AntibodyFlow first constructs the distance matrix, then predicts amino acids conditioned on the distance matrix. Also, AntibodyFlow conducts constraint learning and constrained generation to ensure valid 3D structures. Experimental results indicate that AntibodyFlow outperforms the best baseline consistently with up to 16.0% relative improvement in validity rate and 24.3% relative reduction in geometric graph level error (root mean square deviation, RMSD).
Paper Structure (22 sections, 1 theorem, 25 equations, 6 figures, 3 tables, 1 algorithm)

This paper contains 22 sections, 1 theorem, 25 equations, 6 figures, 3 tables, 1 algorithm.

Key Result

Corollary 1

Suppose $\mathbf{G}'$ are the 3D coordinates of the graph by rotating and translating the original graph whose coordinates are $\mathbf{G}$, $\mathbf{D}$ and $\mathbf{D}'$ are the distance matrix of $\mathbf{G}$ and $\mathbf{G}'$, respectively, then we have $\mathbf{D} = \mathbf{D}'$. The proof is g

Figures (6)

  • Figure 1: Data representation. An antibody is a special kind of protein with a symmetric Y shape, each half of the symmetric unit has two chains: a heavy chain (H) and a light chain (L). In total, there are four chains, two identical H/L chains. The majority of the binding affinity (to specific antigens) is modulated by a set of binding loops called the Complementarity Determining Regions (CDRs) found on the variable domain of each of the H and L chains. There are 6 CDR loops on each half of the antibody, L1, L2, L3 on the light chain, and H1, H2, H3 on the heavy chain. We show the H3 loop of the antibody whose protein data bank (PDB) ID is "5iwl" as an example. The CDR loop in the 3D geometry graph contains 7 amino acids (GGYRAMD) and their coordinates. The sequence is represented as a binary matrix $\mathbf{S} \in \{0,1\}^{7\times 20}$ (20 natural amino acids, each row is a one-hot vector), and the geometry information is represented as a pairwise distance matrix $\mathbf{D}\in\mathbb{R}^{7\times 7}_{+}$.
  • Figure 2: Visualization of interpolation between two CDR H3 loops. For both the 3D geometric structure and amino acid sequence, the changing trajectories are smooth.
  • Figure 3: The whole framework of AntibodyFlow. (Forward) Flow model $f$ incorporates (i) ${\mathbf{z}}_{\mathbf{D}} = f_{\mathbf{D}}(\mathbf{D})$ (Sec. \ref{['sec:distance']}) and (ii) ${\mathbf{z}}_{\mathbf{S}|\mathbf{D}} = f_{\mathbf{S}|\mathbf{D}}(\mathbf{S}|\mathbf{D})$ (Sec. \ref{['sec:amino']}); Inverse flow model $f^{-1}$ incorporates (i) $D=f_{\mathbf{D}}^{-1}({\mathbf{z}}_{\mathbf{D}})$ (Sec \ref{['sec:distance']}) and (ii) $S = f_{\mathbf{S}|\mathbf{D}}^{-1}({\mathbf{z}}_{\mathbf{S}|\mathbf{D}}; \mathbf{z}_{\mathbf{D}})$ (Sec \ref{['sec:amino']}); "Constraint learning" (Sec. \ref{['sec:constraint']}) minimizes constraint loss (Eq. \ref{['eqn:constraint_loss']}) to encourage flow model to learn these constraints. "Constrained 3D coordinates generation" (Sec \ref{['sec:coordinate_generation']}) generates 3D coordinates based on distance matrix and validity constraints of CDR loops defined in Eq. \ref{['eqn:bond_length']} and \ref{['eqn:open_loop']}.
  • Figure 4: An illustration of H function (Eq. \ref{['eqn:huber']}), where $a$=$-1$, $b$=$1$, $\delta$=$1$. It has equal values and slopes of the different sections at the four connection points ($-2, -1, 1, 2$) so it is differentiable everywhere.
  • Figure 5: Sensitivity Analysis on $\lambda_1$ and $\lambda_2$. We find the combination $\lambda_1=50$ and $\lambda_2=100$ works best empirically.
  • ...and 1 more figures

Theorems & Definitions (2)

  • Corollary 1
  • proof