Table of Contents
Fetching ...

S$^2$ALM: Sequence-Structure Pre-trained Large Language Model for Comprehensive Antibody Representation Learning

Mingze Yin, Hanjing Zhou, Jialu Wu, Yiheng Zhu, Yuxuan Zhan, Zitai Kong, Hongxia Xu, Chang-Yu Hsieh, Jintai Chen, Tingjun Hou, Jian Wu

TL;DR

The Sequence-Structure multi-level pre-trained Antibody Language Model (S2ALM) is proposed, combining holistic sequential and structural information in one unified, generic antibody foundation model, outperforms well-established and renowned baselines and sets new state-of-the-art performance across extensive antibody-specific understanding and generation tasks.

Abstract

Antibodies safeguard our health through their precise and potent binding to specific antigens, demonstrating promising therapeutic efficacy in the treatment of numerous diseases, including COVID-19. Recent advancements in biomedical language models have shown the great potential to interpret complex biological structures and functions. However, existing antibody specific models have a notable limitation that they lack explicit consideration for antibody structural information, despite the fact that both 1D sequence and 3D structure carry unique and complementary insights into antibody behavior and functionality. This paper proposes Sequence-Structure multi-level pre-trained Antibody Language Model (S$^2$ALM), combining holistic sequential and structural information in one unified, generic antibody foundation model. We construct a hierarchical pre-training paradigm incorporated with two customized multi-level training objectives to facilitate the modeling of comprehensive antibody representations. S$^2$ALM's representation space uncovers inherent functional binding mechanisms, biological evolution properties and structural interaction patterns. Pre-trained over 75 million sequences and 11.7 million structures, S$^2$ALM can be adopted for diverse downstream tasks: accurately predicting antigen-antibody binding affinities, precisely distinguishing B cell maturation stages, identifying antibody crucial binding positions, and specifically designing novel coronavirus-binding antibodies. Remarkably, S$^2$ALM outperforms well-established and renowned baselines and sets new state-of-the-art performance across extensive antibody specific understanding and generation tasks. S$^2$ALM's ability to model comprehensive and generalized representations further positions its potential to advance real-world therapeutic antibody development, potentially addressing unmet academic, industrial, and clinical needs.

S$^2$ALM: Sequence-Structure Pre-trained Large Language Model for Comprehensive Antibody Representation Learning

TL;DR

The Sequence-Structure multi-level pre-trained Antibody Language Model (S2ALM) is proposed, combining holistic sequential and structural information in one unified, generic antibody foundation model, outperforms well-established and renowned baselines and sets new state-of-the-art performance across extensive antibody-specific understanding and generation tasks.

Abstract

Antibodies safeguard our health through their precise and potent binding to specific antigens, demonstrating promising therapeutic efficacy in the treatment of numerous diseases, including COVID-19. Recent advancements in biomedical language models have shown the great potential to interpret complex biological structures and functions. However, existing antibody specific models have a notable limitation that they lack explicit consideration for antibody structural information, despite the fact that both 1D sequence and 3D structure carry unique and complementary insights into antibody behavior and functionality. This paper proposes Sequence-Structure multi-level pre-trained Antibody Language Model (SALM), combining holistic sequential and structural information in one unified, generic antibody foundation model. We construct a hierarchical pre-training paradigm incorporated with two customized multi-level training objectives to facilitate the modeling of comprehensive antibody representations. SALM's representation space uncovers inherent functional binding mechanisms, biological evolution properties and structural interaction patterns. Pre-trained over 75 million sequences and 11.7 million structures, SALM can be adopted for diverse downstream tasks: accurately predicting antigen-antibody binding affinities, precisely distinguishing B cell maturation stages, identifying antibody crucial binding positions, and specifically designing novel coronavirus-binding antibodies. Remarkably, SALM outperforms well-established and renowned baselines and sets new state-of-the-art performance across extensive antibody specific understanding and generation tasks. SALM's ability to model comprehensive and generalized representations further positions its potential to advance real-world therapeutic antibody development, potentially addressing unmet academic, industrial, and clinical needs.

Paper Structure

This paper contains 24 sections, 7 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Overview of the proposed hierarchical pre-training paradigm containing two stages.a, In stage I, S$^2$ALM aims at general sequence-structure learning with protein sequences and structures. In stage II, S$^2$ALM learns antibody specific multi-level knowledge using antibody sequences and structures. b, Masked Language Modeling (MLM) reconstructs the masked tokens based on the contextualized information. c, Sequence-Structure Matching (SSM) identifies the matching relationships between 1D and 3Di sequences. d, Cross-Level Reconstruction (CLR) reconstructs the corrupted tokens based on hybrid information from both 1D and 3Di sequences.
  • Figure 2: Illustrations of compositional ratios of the pre-training data and the structural encoding protocol.a, The protein data contains three parts: sequences, experimentally-determined structures, computationally-predicted structures. b, The antibody data contains four parts: sequences, experimentally-determined structures, computationally-predicted structures from ABodyBuilder2 ABodyBuilder2 and IgFold IgFold. c, Efficient encoding protocol of protein 3D structures. Foldseek foldseek is employed to discretize the target 3D structure into the 3Di sequences, deciphering 3Di states which describe the tertiary interaction between a residue and its nearest neighbor.
  • Figure 3: The t-SNE visualization results. Different colors indicate antibodies with different categories correspondingly. Untrained S$^2$ALM and pre-trained ESM-2 are included for comparison. The visualization analyses demonstrate that S$^2$ALM contains information about functional specificity, biological species and evolutionary isotypes in its comprehensive encoded representations.
  • Figure 4: S$^2$ALM exhibits superior performance on antibody understanding and generation tasks.a, Interpretability analysis of S$^2$ALM in capturing antibody structural interaction patterns. The heatmap reveals the self-attention values of the STE90-C11's heavy chain, derived from the last hidden layer of the 3th head in S$^2$ALM. The crystal structure of STE90-C11 (PDB: 7B3O) confirms the interaction mediated by the hydrogen bond between TRP157 and SER183. b, Evaluation results of the generated antibodies on antibody CDR design task. S$^2$ALM simultaneously balances the generative PPL, AAR and DIV. c-d, Experimental performance on antigen binding capability prediction, B cell maturation analysis and antibody paratope prediction tasks. (Antibody paratope prediction datasets are from EATLM in c and AntiBERTa in d.). S$^2$ALM consistently achieves state-of-the-art performance across all included evaluation metrics compared to all baseline models. e, Structural evaluation results of the generated antibodies. AlphaFold3 is employed to predict 3D structures. S$^2$ALM surpasses other baseline models in terms of pLDDT, pTM and ipTM. f, 3D structure visualization of the generated complexes. Targeting three specific pathogens (i.e., Vaccinia virus, Neisseria meningitidis and Influenza B virus), S$^2$ALM is employed to design the antibody CDR-H3 (highlighted in red). The stable and regular 3D structures of designed antigen-antibody complexes fully demonstrates the superiority of S$^2$ALM.