Decoupled Sequence and Structure Generation for Realistic Antibody Design
Nayoung Kim, Minsu Kim, Sungsoo Ahn, Jinkyoo Park
TL;DR
ASSD introduces a sequence-structure decoupling framework for antibody design by factorizing the joint design into a sequence design model $p_\theta(\mathbf{s}|\mathbf{c})$ and a structure predictor $p_\phi(\mathbf{x}|\mathbf{s},\mathbf{c})$. A composition-based objective with REINFORCE mitigates excessive token repetition in non-autoregressive sequence generation, enabling efficient training on large sequence databases via a protein language model (ESM2-650M) with LoRA fine-tuning. Across SAbDab, RAbD, affinity optimization, and docked-template scenarios, ASSD achieves superior or competitive amino-acid recovery (AAR) and structural accuracy (RMSD/TM-score) while substantially reducing token repetition ($p_{\text{rep}}$). The approach generalizes to protein design and is robust to data leakage, suggesting a practical, scalable path for realistic antibody and protein design with architecture- and task-specific components.
Abstract
Recently, deep learning has made rapid progress in antibody design, which plays a key role in the advancement of therapeutics. A dominant paradigm is to train a model to jointly generate the antibody sequence and the structure as a candidate. However, the joint generation requires the model to generate both the discrete amino acid categories and the continuous 3D coordinates; this limits the space of possible architectures and may lead to suboptimal performance. In response, we propose an antibody sequence-structure decoupling (ASSD) framework, which separates sequence generation and structure prediction. Although our approach is simple, our idea allows the use of powerful neural architectures and demonstrates notable performance improvements. We also find that the widely used non-autoregressive generators promote sequences with overly repeating tokens. Such sequences are both out-of-distribution and prone to undesirable developability properties that can trigger harmful immune responses in patients. To resolve this, we introduce a composition-based objective that allows an efficient trade-off between high performance and low token repetition. ASSD shows improved performance in various antibody design experiments, while the composition-based objective successfully mitigates token repetition of non-autoregressive models.
