A SARS-CoV-2 Interaction Dataset and VHH Sequence Corpus for Antibody Language Models

Hirofumi Tsuruta; Hiroyuki Yamazaki; Ryota Maeda; Ryotaro Tamura; Akihiro Imura

A SARS-CoV-2 Interaction Dataset and VHH Sequence Corpus for Antibody Language Models

Hirofumi Tsuruta, Hiroyuki Yamazaki, Ryota Maeda, Ryotaro Tamura, Akihiro Imura

TL;DR

Results confirm that AVIDa-SARS-CoV-2 provides valuable benchmarks for evaluating the representation capabilities of antibody language models for binding prediction, thereby facilitating the development of AI-driven antibody discovery.

Abstract

Antibodies are crucial proteins produced by the immune system to eliminate harmful foreign substances and have become pivotal therapeutic agents for treating human diseases. To accelerate the discovery of antibody therapeutics, there is growing interest in constructing language models using antibody sequences. However, the applicability of pre-trained language models for antibody discovery has not been thoroughly evaluated due to the scarcity of labeled datasets. To overcome these limitations, we introduce AVIDa-SARS-CoV-2, a dataset featuring the antigen-variable domain of heavy chain of heavy chain antibody (VHH) interactions obtained from two alpacas immunized with severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) spike proteins. AVIDa-SARS-CoV-2 includes binary labels indicating the binding or non-binding of diverse VHH sequences to 12 SARS-CoV-2 mutants, such as the Delta and Omicron variants. Furthermore, we release VHHCorpus-2M, a pre-training dataset for antibody language models, containing over two million VHH sequences. We report benchmark results for predicting SARS-CoV-2-VHH binding using VHHBERT pre-trained on VHHCorpus-2M and existing general protein and antibody-specific pre-trained language models. These results confirm that AVIDa-SARS-CoV-2 provides valuable benchmarks for evaluating the representation capabilities of antibody language models for binding prediction, thereby facilitating the development of AI-driven antibody discovery. The datasets are available at https://datasets.cognanous.com.

A SARS-CoV-2 Interaction Dataset and VHH Sequence Corpus for Antibody Language Models

TL;DR

Abstract

Paper Structure (45 sections, 1 equation, 5 figures, 6 tables)

This paper contains 45 sections, 1 equation, 5 figures, 6 tables.

Introduction
Related Work
Pre-trained Antibody Language Models.
Pre-training Datasets.
Evaluation Datasets.
AVIDa-SARS-CoV-2: Antigen-VHH Interaction Dataset Produced from Alpaca Immunized with SARS-CoV-2 Spike Proteins
Dataset Generation
Immunization
Affinity Selection
Data Labeling
Dataset Analysis
Binding Sensitivity to Sequence Variation
Individual Differences in Antigen-specific Antibody Production
Differences with AVIDa-hIL6
VHHCorpus-2M: VHH Sequence Corpus Produced from Alpaca
...and 30 more sections

Figures (5)

Figure 1: Overview of data generation process for AVIDa-SARS-CoV-2.
Figure 2: (a) Label visualization for each pair between 54 VHHs in three clusters and antigens. Each cell represents unique VHH-antigen pair. White cells are unlabeled pairs that cannot be identified as "binder" or "non-binder" and are not included in AVIDa-SARS-CoV-2. (b)(c) Two-dimensional representation of binder sequences colored by individuals and clusters. Appendix \ref{['sec:appendix_dataset_analysis']} provides enlarged versions of (b) and (c).
Figure 3: Distribution of pairwise identities of VHH sequences.
Figure 4: Overview of the experimental setup.
Figure 5: Two-dimensional representation of binder sequences colored by (a) individuals and (b) clusters.

A SARS-CoV-2 Interaction Dataset and VHH Sequence Corpus for Antibody Language Models

TL;DR

Abstract

A SARS-CoV-2 Interaction Dataset and VHH Sequence Corpus for Antibody Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (5)