Table of Contents
Fetching ...

Improving Antibody Humanness Prediction using Patent Data

Talip Ucar, Aubin Ramon, Dino Oglic, Rebecca Croasdale-Wood, Tom Diethe, Pietro Sormanni

TL;DR

This paper tackles the challenge of predicting antibody humanness by leveraging noisy patent-derived data through a two-stage learning framework called SelfPAD. It first performs weakly-supervised contrastive pre-training on the Patented Antibody Database to learn context-aware amino-acid embeddings, then fine-tunes a Transformer-based encoder with an MLP to predict humanness via cross-entropy. Empirically, SelfPAD outperforms Hu-mAb, OASis, and AbNatiV across six inference tasks, achieving state-of-the-art results on five, underscoring the value of patent-sourced representations for immunogenicity prediction. The work highlights both the practical potential and limitations of patent data for antibody developability assessments and points to integrating additional data sources to further enhance robustness and coverage.

Abstract

We investigate the potential of patent data for improving the antibody humanness prediction using a multi-stage, multi-loss training process. Humanness serves as a proxy for the immunogenic response to antibody therapeutics, one of the major causes of attrition in drug discovery and a challenging obstacle for their use in clinical settings. We pose the initial learning stage as a weakly-supervised contrastive-learning problem, where each antibody sequence is associated with possibly multiple identifiers of function and the objective is to learn an encoder that groups them according to their patented properties. We then freeze a part of the contrastive encoder and continue training it on the patent data using the cross-entropy loss to predict the humanness score of a given antibody sequence. We illustrate the utility of the patent data and our approach by performing inference on three different immunogenicity datasets, unseen during training. Our empirical results demonstrate that the learned model consistently outperforms the alternative baselines and establishes new state-of-the-art on five out of six inference tasks, irrespective of the used metric.

Improving Antibody Humanness Prediction using Patent Data

TL;DR

This paper tackles the challenge of predicting antibody humanness by leveraging noisy patent-derived data through a two-stage learning framework called SelfPAD. It first performs weakly-supervised contrastive pre-training on the Patented Antibody Database to learn context-aware amino-acid embeddings, then fine-tunes a Transformer-based encoder with an MLP to predict humanness via cross-entropy. Empirically, SelfPAD outperforms Hu-mAb, OASis, and AbNatiV across six inference tasks, achieving state-of-the-art results on five, underscoring the value of patent-sourced representations for immunogenicity prediction. The work highlights both the practical potential and limitations of patent data for antibody developability assessments and points to integrating additional data sources to further enhance robustness and coverage.

Abstract

We investigate the potential of patent data for improving the antibody humanness prediction using a multi-stage, multi-loss training process. Humanness serves as a proxy for the immunogenic response to antibody therapeutics, one of the major causes of attrition in drug discovery and a challenging obstacle for their use in clinical settings. We pose the initial learning stage as a weakly-supervised contrastive-learning problem, where each antibody sequence is associated with possibly multiple identifiers of function and the objective is to learn an encoder that groups them according to their patented properties. We then freeze a part of the contrastive encoder and continue training it on the patent data using the cross-entropy loss to predict the humanness score of a given antibody sequence. We illustrate the utility of the patent data and our approach by performing inference on three different immunogenicity datasets, unseen during training. Our empirical results demonstrate that the learned model consistently outperforms the alternative baselines and establishes new state-of-the-art on five out of six inference tasks, irrespective of the used metric.
Paper Structure (11 sections, 3 equations, 6 figures, 10 tables)

This paper contains 11 sections, 3 equations, 6 figures, 10 tables.

Figures (6)

  • Figure 1: SelfPAD Framework:a)Pre-training: The training process that considers sequences associated with the same potential targets as positive pairs. b)Fine-tuning: The architecture of the transformer-based sequence encoder used in pre-training as well as MLP, swap noise and shift noise used only during fine-tuning.
  • Figure 2: After pre-training: t-SNE visualisation of antibodies in the test set based on their a) origin and b) potential targets (illustration for a small subset of targets). After fine-tuning the model to predict for humanness:c) t-SNE plot of heavy, kappa, lambda and VHH chains from different species (H denotes human and O heavy chains from other species).
  • Figure 3: Human versus Nonhuman sequences: PCA of embeddings from FT-SelfPAD for a) the test set of PAD, b) 553 therapeutics data from OASis, c) 217 immunogenicity data from OASis, d) 25 humanization data from Hu-mAb, where 25 therapeutic antibodies with the immunogenicity problem (i.e., parental sequence) as well as their humanized versions were plotted together with other test fold sequences from the PAD (shown in blue).
  • Figure 4: Comparing SelfPAD-inferred (bottom) to experimental (top) positions in humanizing Certolizumab's heavy chain.
  • Figure 5: Training loss during pre-training of SelfPAD.
  • ...and 1 more figures