Table of Contents
Fetching ...

Large scale paired antibody language models

Henry Kenlay, Frédéric A. Dreyer, Aleksandr Kovaltsuk, Dom Miketa, Douglas Pires, Charlotte M. Deane

TL;DR

This work introduces IgBert and IgT5, large-scale antibody-specific language models trained on over $2\times 10^{9}$ unpaired and $2\times 10^{6}$ paired heavy/light sequences from the Observed Antibody Space (OAS). By first pre-training on unpaired data and then fine-tuning on paired sequences, the authors enable cross-chain feature learning that improves sequence recovery and binding-affinity predictions, while showcasing the complementary strengths of antibody-specific versus general protein models in related tasks. Key findings include superior performance of paired models on binding energy prediction and sequence recovery, with general models sometimes better for expression, and a notable reduction in pseudo-perplexity, indicating more accurate sequence modeling. The work demonstrates the practical potential of large-scale, paired antibody LMs for therapeutic design and provides publicly available models to accelerate antibody engineering workflows, while outlining future enhancements through integration with structure, broader pretraining data, and generative capabilities.

Abstract

Antibodies are proteins produced by the immune system that can identify and neutralise a wide variety of antigens with high specificity and affinity, and constitute the most successful class of biotherapeutics. With the advent of next-generation sequencing, billions of antibody sequences have been collected in recent years, though their application in the design of better therapeutics has been constrained by the sheer volume and complexity of the data. To address this challenge, we present IgBert and IgT5, the best performing antibody-specific language models developed to date which can consistently handle both paired and unpaired variable region sequences as input. These models are trained comprehensively using the more than two billion unpaired sequences and two million paired sequences of light and heavy chains present in the Observed Antibody Space dataset. We show that our models outperform existing antibody and protein language models on a diverse range of design and regression tasks relevant to antibody engineering. This advancement marks a significant leap forward in leveraging machine learning, large scale data sets and high-performance computing for enhancing antibody design for therapeutic development.

Large scale paired antibody language models

TL;DR

This work introduces IgBert and IgT5, large-scale antibody-specific language models trained on over unpaired and paired heavy/light sequences from the Observed Antibody Space (OAS). By first pre-training on unpaired data and then fine-tuning on paired sequences, the authors enable cross-chain feature learning that improves sequence recovery and binding-affinity predictions, while showcasing the complementary strengths of antibody-specific versus general protein models in related tasks. Key findings include superior performance of paired models on binding energy prediction and sequence recovery, with general models sometimes better for expression, and a notable reduction in pseudo-perplexity, indicating more accurate sequence modeling. The work demonstrates the practical potential of large-scale, paired antibody LMs for therapeutic design and provides publicly available models to accelerate antibody engineering workflows, while outlining future enhancements through integration with structure, broader pretraining data, and generative capabilities.

Abstract

Antibodies are proteins produced by the immune system that can identify and neutralise a wide variety of antigens with high specificity and affinity, and constitute the most successful class of biotherapeutics. With the advent of next-generation sequencing, billions of antibody sequences have been collected in recent years, though their application in the design of better therapeutics has been constrained by the sheer volume and complexity of the data. To address this challenge, we present IgBert and IgT5, the best performing antibody-specific language models developed to date which can consistently handle both paired and unpaired variable region sequences as input. These models are trained comprehensively using the more than two billion unpaired sequences and two million paired sequences of light and heavy chains present in the Observed Antibody Space dataset. We show that our models outperform existing antibody and protein language models on a diverse range of design and regression tasks relevant to antibody engineering. This advancement marks a significant leap forward in leveraging machine learning, large scale data sets and high-performance computing for enhancing antibody design for therapeutic development.
Paper Structure (18 sections, 4 equations, 2 figures, 6 tables)

This paper contains 18 sections, 4 equations, 2 figures, 6 tables.

Figures (2)

  • Figure 1: Overview of an antibody structure and its domains. The sequence of the variable region is used as input to the transformer encoder to obtain a residue-level embedding representation. Training is achieved through masked language modelling, where a random fraction of the input is replaced by mask tokens.
  • Figure 2: Data processing and training strategy. We further pre-train the ProtT5 and ProtBert models from prottrans on unpaired antibody sequences from OAS after clustering them with Linclust. These unpaired models are then fine-tuned on paired sequences clustered with MMseqs2, combining the two variable region chains into a single input with a separator token.