Table of Contents
Fetching ...

Exploring Protein Language Model Architecture-Induced Biases for Antibody Comprehension

Mengren, Liu, Yixiang Zhang, Yiming, Zhang

TL;DR

This study investigates how protein language model architectures induce biases in understanding antibody sequences. By comparing AntiBERTa, BioBERT, ESM2, and a GPT-2 baseline on rhesus macaque heavy-chain fragments, it shows that specialized models naturally attend to biologically relevant regions (notably CDRs) and features (V gene usage, SHM, isotypes), while general models rely on explicit training strategies to uncover these signals. A key finding is that incorporating biological priors, such as CDR3-focused pooling, can significantly improve training efficiency and predictive performance for non-specialized models. The work provides guidance for designing PLMs tailored to antibody engineering and highlights how biology-guided training can accelerate discovery in computational immunology.

Abstract

Recent advances in protein language models (PLMs) have demonstrated remarkable capabilities in understanding protein sequences. However, the extent to which different model architectures capture antibody-specific biological properties remains unexplored. In this work, we systematically investigate how architectural choices in PLMs influence their ability to comprehend antibody sequence characteristics and functions. We evaluate three state-of-the-art PLMs-AntiBERTa, BioBERT, and ESM2--against a general-purpose language model (GPT-2) baseline on antibody target specificity prediction tasks. Our results demonstrate that while all PLMs achieve high classification accuracy, they exhibit distinct biases in capturing biological features such as V gene usage, somatic hypermutation patterns, and isotype information. Through attention attribution analysis, we show that antibody-specific models like AntiBERTa naturally learn to focus on complementarity-determining regions (CDRs), while general protein models benefit significantly from explicit CDR-focused training strategies. These findings provide insights into the relationship between model architecture and biological feature extraction, offering valuable guidance for future PLM development in computational antibody design.

Exploring Protein Language Model Architecture-Induced Biases for Antibody Comprehension

TL;DR

This study investigates how protein language model architectures induce biases in understanding antibody sequences. By comparing AntiBERTa, BioBERT, ESM2, and a GPT-2 baseline on rhesus macaque heavy-chain fragments, it shows that specialized models naturally attend to biologically relevant regions (notably CDRs) and features (V gene usage, SHM, isotypes), while general models rely on explicit training strategies to uncover these signals. A key finding is that incorporating biological priors, such as CDR3-focused pooling, can significantly improve training efficiency and predictive performance for non-specialized models. The work provides guidance for designing PLMs tailored to antibody engineering and highlights how biology-guided training can accelerate discovery in computational immunology.

Abstract

Recent advances in protein language models (PLMs) have demonstrated remarkable capabilities in understanding protein sequences. However, the extent to which different model architectures capture antibody-specific biological properties remains unexplored. In this work, we systematically investigate how architectural choices in PLMs influence their ability to comprehend antibody sequence characteristics and functions. We evaluate three state-of-the-art PLMs-AntiBERTa, BioBERT, and ESM2--against a general-purpose language model (GPT-2) baseline on antibody target specificity prediction tasks. Our results demonstrate that while all PLMs achieve high classification accuracy, they exhibit distinct biases in capturing biological features such as V gene usage, somatic hypermutation patterns, and isotype information. Through attention attribution analysis, we show that antibody-specific models like AntiBERTa naturally learn to focus on complementarity-determining regions (CDRs), while general protein models benefit significantly from explicit CDR-focused training strategies. These findings provide insights into the relationship between model architecture and biological feature extraction, offering valuable guidance for future PLM development in computational antibody design.

Paper Structure

This paper contains 8 sections, 8 figures, 1 table.

Figures (8)

  • Figure 1: A schematic illustration of the composition of an antibody. Antibody is composed of two heavy chains and two light chains. Both chains have a variable region that is unique to each B cell and a constant region (one of IgM, IgD, IgG, IgA, and IgE for heavy chains, and either IgK or IgL for light chains). The variable region is generated by the genetic combination of a V gene, a D gene, and a J gene. The complementary-determining regions (CDRs) on the variable region are the most important segments that determine antigen specificity.
  • Figure 2: Graphical abstract of model architectural exploration. A. Encoding antibody sequences into latent embeddings using different encoders (i.g., ESM2, AntiBERTa, etc.). B. Adding antibody target labels to each embedded antibody sequence. C. Train classifiers of different architectures for each embedding to predict antigen binding.
  • Figure 3: The validation accuracy percentages of each classifier architecture paired with corresponding PLMs or GPT-2 models. Grouped bars differentiate by harch patterns and dark green (the transformer classifier), turquoise (the MLP classifier), and yellow (the FC classifier).
  • Figure 4: UMAP of output vectors under different language models and classifier model pairs. The output vector of each antibody sequence was colored based on their target specificity in black (HIV+), purple (Pn3+), and beige (Pn3-).
  • Figure 5: UMAP of embedding vectors under different language models highlighted with various antibody-specific biological properties. A. The antibody target specificity. B. The V gene family of each antibody sequence. C. The somatic hypermutation of each antibody sequence. D. The antibody heavy chain (IgH) constant region isotypes.
  • ...and 3 more figures