Exploring Protein Language Model Architecture-Induced Biases for Antibody Comprehension

Mengren; Liu; Yixiang Zhang; Yiming; Zhang

Exploring Protein Language Model Architecture-Induced Biases for Antibody Comprehension

Mengren, Liu, Yixiang Zhang, Yiming, Zhang

TL;DR

This study investigates how protein language model architectures induce biases in understanding antibody sequences. By comparing AntiBERTa, BioBERT, ESM2, and a GPT-2 baseline on rhesus macaque heavy-chain fragments, it shows that specialized models naturally attend to biologically relevant regions (notably CDRs) and features (V gene usage, SHM, isotypes), while general models rely on explicit training strategies to uncover these signals. A key finding is that incorporating biological priors, such as CDR3-focused pooling, can significantly improve training efficiency and predictive performance for non-specialized models. The work provides guidance for designing PLMs tailored to antibody engineering and highlights how biology-guided training can accelerate discovery in computational immunology.

Abstract

Recent advances in protein language models (PLMs) have demonstrated remarkable capabilities in understanding protein sequences. However, the extent to which different model architectures capture antibody-specific biological properties remains unexplored. In this work, we systematically investigate how architectural choices in PLMs influence their ability to comprehend antibody sequence characteristics and functions. We evaluate three state-of-the-art PLMs-AntiBERTa, BioBERT, and ESM2--against a general-purpose language model (GPT-2) baseline on antibody target specificity prediction tasks. Our results demonstrate that while all PLMs achieve high classification accuracy, they exhibit distinct biases in capturing biological features such as V gene usage, somatic hypermutation patterns, and isotype information. Through attention attribution analysis, we show that antibody-specific models like AntiBERTa naturally learn to focus on complementarity-determining regions (CDRs), while general protein models benefit significantly from explicit CDR-focused training strategies. These findings provide insights into the relationship between model architecture and biological feature extraction, offering valuable guidance for future PLM development in computational antibody design.

Exploring Protein Language Model Architecture-Induced Biases for Antibody Comprehension

TL;DR

Abstract

Exploring Protein Language Model Architecture-Induced Biases for Antibody Comprehension

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (8)