Table of Contents
Fetching ...

Decoupled Sequence and Structure Generation for Realistic Antibody Design

Nayoung Kim, Minsu Kim, Sungsoo Ahn, Jinkyoo Park

TL;DR

ASSD introduces a sequence-structure decoupling framework for antibody design by factorizing the joint design into a sequence design model $p_\theta(\mathbf{s}|\mathbf{c})$ and a structure predictor $p_\phi(\mathbf{x}|\mathbf{s},\mathbf{c})$. A composition-based objective with REINFORCE mitigates excessive token repetition in non-autoregressive sequence generation, enabling efficient training on large sequence databases via a protein language model (ESM2-650M) with LoRA fine-tuning. Across SAbDab, RAbD, affinity optimization, and docked-template scenarios, ASSD achieves superior or competitive amino-acid recovery (AAR) and structural accuracy (RMSD/TM-score) while substantially reducing token repetition ($p_{\text{rep}}$). The approach generalizes to protein design and is robust to data leakage, suggesting a practical, scalable path for realistic antibody and protein design with architecture- and task-specific components.

Abstract

Recently, deep learning has made rapid progress in antibody design, which plays a key role in the advancement of therapeutics. A dominant paradigm is to train a model to jointly generate the antibody sequence and the structure as a candidate. However, the joint generation requires the model to generate both the discrete amino acid categories and the continuous 3D coordinates; this limits the space of possible architectures and may lead to suboptimal performance. In response, we propose an antibody sequence-structure decoupling (ASSD) framework, which separates sequence generation and structure prediction. Although our approach is simple, our idea allows the use of powerful neural architectures and demonstrates notable performance improvements. We also find that the widely used non-autoregressive generators promote sequences with overly repeating tokens. Such sequences are both out-of-distribution and prone to undesirable developability properties that can trigger harmful immune responses in patients. To resolve this, we introduce a composition-based objective that allows an efficient trade-off between high performance and low token repetition. ASSD shows improved performance in various antibody design experiments, while the composition-based objective successfully mitigates token repetition of non-autoregressive models.

Decoupled Sequence and Structure Generation for Realistic Antibody Design

TL;DR

ASSD introduces a sequence-structure decoupling framework for antibody design by factorizing the joint design into a sequence design model and a structure predictor . A composition-based objective with REINFORCE mitigates excessive token repetition in non-autoregressive sequence generation, enabling efficient training on large sequence databases via a protein language model (ESM2-650M) with LoRA fine-tuning. Across SAbDab, RAbD, affinity optimization, and docked-template scenarios, ASSD achieves superior or competitive amino-acid recovery (AAR) and structural accuracy (RMSD/TM-score) while substantially reducing token repetition (). The approach generalizes to protein design and is robust to data leakage, suggesting a practical, scalable path for realistic antibody and protein design with architecture- and task-specific components.

Abstract

Recently, deep learning has made rapid progress in antibody design, which plays a key role in the advancement of therapeutics. A dominant paradigm is to train a model to jointly generate the antibody sequence and the structure as a candidate. However, the joint generation requires the model to generate both the discrete amino acid categories and the continuous 3D coordinates; this limits the space of possible architectures and may lead to suboptimal performance. In response, we propose an antibody sequence-structure decoupling (ASSD) framework, which separates sequence generation and structure prediction. Although our approach is simple, our idea allows the use of powerful neural architectures and demonstrates notable performance improvements. We also find that the widely used non-autoregressive generators promote sequences with overly repeating tokens. Such sequences are both out-of-distribution and prone to undesirable developability properties that can trigger harmful immune responses in patients. To resolve this, we introduce a composition-based objective that allows an efficient trade-off between high performance and low token repetition. ASSD shows improved performance in various antibody design experiments, while the composition-based objective successfully mitigates token repetition of non-autoregressive models.
Paper Structure (25 sections, 10 equations, 6 figures, 14 tables, 3 algorithms)

This paper contains 25 sections, 10 equations, 6 figures, 14 tables, 3 algorithms.

Figures (6)

  • Figure 1: Schematic structure of an antibody. An antibody consists of a pair of heavy and light chains, each containing a variable region and constant regions. The variable region consists of three complementarity-determining regions (CDRs) and its complement called the framework regions. We aim to design heavy chain CDRs, which contribute the most to antibody-antigen interaction.
  • Figure 2: Overview of antibody sequence-structure decoupling (ASSD) framework. ASSD first designs CDR sequences with a sequence design model and then predicts the corresponding structure with a sequence-to-structure model. $\bm{s}, \bm{x}$ are the ground-truth, and $\bm{\hat{s}}, \bm{\hat{x}}$ are the predicted CDR sequence and structure. $\bm{c}$ denotes conditional information, e.g., framework region, antigen, and sequence/structure initialization.
  • Figure 3: Example to illustrate token repetition problem.
  • Figure 4: Effect of $\alpha$ on AAR and $p_{\text{rep}}$. The color bar represents the value of $\alpha$ in the MLE-RL objective. (a) Result for CDR-H3 of SAbDab benchmark. ASSD approach achieves the Pareto frontier over the non-autoregressive baselines. (b) Result for CATH 4.2 (left) and CATH 4.3 (right) benchmarks. LM-Design trained with our MLE-RL objective maintains performance comparable to ProteinMPNN, while reducing $p_{\text{rep}}$ significantly.
  • Figure 5: Sequence dataset for hypothetical example in \ref{['sec:methods_overview']}.
  • ...and 1 more figures