Table of Contents
Fetching ...

Protein Language Models Diverge from Natural Language: Comparative Analysis and Improved Inference

Anna Hart, Chi Han, Jeonghwan Kim, Huimin Zhao, Heng Ji

TL;DR

This work begins by directly comparing how the distribution of information stored across layers of attention heads differs between the protein and natural language domain and opens up an area of research directly comparing how language models change behavior when moved into the protein domain and advances language modeling in biological domains.

Abstract

Modern Protein Language Models (PLMs) apply transformer-based model architectures from natural language processing to biological sequences, predicting a variety of protein functions and properties. However, protein language has key differences from natural language, such as a rich functional space despite a vocabulary of only 20 amino acids. These differences motivate research into how transformer-based architectures operate differently in the protein domain and how we can better leverage PLMs to solve protein-related tasks. In this work, we begin by directly comparing how the distribution of information stored across layers of attention heads differs between the protein and natural language domain. Furthermore, we adapt a simple early-exit technique-originally used in the natural language domain to improve efficiency at the cost of performance-to achieve both increased accuracy and substantial efficiency gains in protein non-structural property prediction by allowing the model to automatically select protein representations from the intermediate layers of the PLMs for the specific task and protein at hand. We achieve performance gains ranging from 0.4 to 7.01 percentage points while simultaneously improving efficiency by over 10 percent across models and non-structural prediction tasks. Our work opens up an area of research directly comparing how language models change behavior when moved into the protein domain and advances language modeling in biological domains.

Protein Language Models Diverge from Natural Language: Comparative Analysis and Improved Inference

TL;DR

This work begins by directly comparing how the distribution of information stored across layers of attention heads differs between the protein and natural language domain and opens up an area of research directly comparing how language models change behavior when moved into the protein domain and advances language modeling in biological domains.

Abstract

Modern Protein Language Models (PLMs) apply transformer-based model architectures from natural language processing to biological sequences, predicting a variety of protein functions and properties. However, protein language has key differences from natural language, such as a rich functional space despite a vocabulary of only 20 amino acids. These differences motivate research into how transformer-based architectures operate differently in the protein domain and how we can better leverage PLMs to solve protein-related tasks. In this work, we begin by directly comparing how the distribution of information stored across layers of attention heads differs between the protein and natural language domain. Furthermore, we adapt a simple early-exit technique-originally used in the natural language domain to improve efficiency at the cost of performance-to achieve both increased accuracy and substantial efficiency gains in protein non-structural property prediction by allowing the model to automatically select protein representations from the intermediate layers of the PLMs for the specific task and protein at hand. We achieve performance gains ranging from 0.4 to 7.01 percentage points while simultaneously improving efficiency by over 10 percent across models and non-structural prediction tasks. Our work opens up an area of research directly comparing how language models change behavior when moved into the protein domain and advances language modeling in biological domains.
Paper Structure (16 sections, 1 equation, 5 figures, 2 tables, 1 algorithm)

This paper contains 16 sections, 1 equation, 5 figures, 2 tables, 1 algorithm.

Figures (5)

  • Figure 1: The scheme for early-exit, based on Schwartz et al. schwartz_right_2020. The input protein sequence is fed into the PLM. At each layer, an MLP makes a prediction for the downstream task and the confidence of this prediction is calculated. When the confidence reaches a predetermined threshold, the model will output the result from the current layer and cease further execution.
  • Figure 2: Many PLMs display more variability in their attention focus than the corresponding NLM. The heat map displays how attention heads distribute their focus between positional and semantic information across 1,000 inputs, plotting each head for each input by ratio of positional to semantic information focus. These plots are generated for NLMs BERT devlin_bert_2019, AlBERT lan_albert_2020, T5 encoder raffel_exploring_2020, and XLNet yang_xlnet_2019 and their corresponding PLMs elnaggar_prottrans_2022. The y axis represents the ratio of positional:semantic information captured by the attention heads, and the color represents the number of attention heads in that layer per ratio bin. All attention heads, for each of 1,000 inputs, are accounted for in each layer. As shown in the figure, more variability in the attention focus is displayed in the protein versions of BERT, ALBERT, and T5, with XLNet as an exception.
  • Figure 3: Early-Exit Improves both Performance and Efficiency in Non-Structural Tasks across Multiple PLMs. The total number of computed layers is used as a proxy for efficiency. The trade-offs between model performance and efficiency are calculated for: (1) Individual Layer Performance, (2) Early-Exit Last Layer Fallback, and (3) Early-exit Most Confident Layer Fallback. The baseline performance of the last layer is drawn across with a black line. Computations are done for ESM2 lin_evolutionary-scale_2023, ProtBERT, and ProtALBERT elnaggar_prottrans_2022. Early-exit Most Confident Layer Fallback outperforms both the last-layer performance baseline and early-exit Last Layer Fallback regarding both performance and efficiency in non-structural tasks. For the secondary structure prediction, early-exit allows efficiency gains but harms performance.
  • Figure 4: Walltimes versus total number of computed layers The walltime for the testing set on 1 V100 GPU versus the average number of computed layers is plotted across all models and tasks. Early-exit Most Confident Layer Fallback is plotted. A diamond marker at the final layer denotes the baseline walltime. We see that walltime corresponds linearly with the number of computed layers
  • Figure 5: Confidence calibration. A lower excess AURC score geifman_2019 denotes a better calibrated confidence metric. We see that, for all models, confidence is well calibrated across layers for EC, is well calibrated in middle and later layers for CL, and is poorly calibrated for GO tasks.