Table of Contents
Fetching ...

ProtTrans: Towards Cracking the Language of Life's Code Through Self-Supervised Deep Learning and High Performance Computing

Ahmed Elnaggar, Michael Heinzinger, Christian Dallago, Ghalia Rihawi, Yu Wang, Llion Jones, Tom Gibbs, Tamas Feher, Christoph Angerer, Martin Steinegger, Debsindhu Bhowmik, Burkhard Rost

TL;DR

ProtTrans demonstrates that self-supervised, Transformer-based language models trained on massive protein sequence corpora can learn embeddings that encode structural and functional protein information without relying on evolutionary information. By transferring these embeddings to lightweight supervised models, the approach achieves competitive or superior performance on per-residue secondary structure and per-protein localization/membrane predictions, often matching state-of-the-art EI-based methods while enabling fast, proteome-scale inference. A key finding is that longer pretraining with large and diverse data yields richer representations, and the best-performing models (e.g., ProtT5-XL-U50) can outperform traditional MSAs in many tasks. Collectively, the work highlights a productive convergence of HPC, NLP-inspired modeling, and computational biology, enabling rapid, scalable, MSA-free protein predictions with broad practical impact.

Abstract

Computational biology and bioinformatics provide vast data gold-mines from protein sequences, ideal for Language Models taken from NLP. These LMs reach for new prediction frontiers at low inference costs. Here, we trained two auto-regressive models (Transformer-XL, XLNet) and four auto-encoder models (BERT, Albert, Electra, T5) on data from UniRef and BFD containing up to 393 billion amino acids. The LMs were trained on the Summit supercomputer using 5616 GPUs and TPU Pod up-to 1024 cores. Dimensionality reduction revealed that the raw protein LM-embeddings from unlabeled data captured some biophysical features of protein sequences. We validated the advantage of using the embeddings as exclusive input for several subsequent tasks. The first was a per-residue prediction of protein secondary structure (3-state accuracy Q3=81%-87%); the second were per-protein predictions of protein sub-cellular localization (ten-state accuracy: Q10=81%) and membrane vs. water-soluble (2-state accuracy Q2=91%). For the per-residue predictions the transfer of the most informative embeddings (ProtT5) for the first time outperformed the state-of-the-art without using evolutionary information thereby bypassing expensive database searches. Taken together, the results implied that protein LMs learned some of the grammar of the language of life. To facilitate future work, we released our models at https://github.com/agemagician/ProtTrans.

ProtTrans: Towards Cracking the Language of Life's Code Through Self-Supervised Deep Learning and High Performance Computing

TL;DR

ProtTrans demonstrates that self-supervised, Transformer-based language models trained on massive protein sequence corpora can learn embeddings that encode structural and functional protein information without relying on evolutionary information. By transferring these embeddings to lightweight supervised models, the approach achieves competitive or superior performance on per-residue secondary structure and per-protein localization/membrane predictions, often matching state-of-the-art EI-based methods while enabling fast, proteome-scale inference. A key finding is that longer pretraining with large and diverse data yields richer representations, and the best-performing models (e.g., ProtT5-XL-U50) can outperform traditional MSAs in many tasks. Collectively, the work highlights a productive convergence of HPC, NLP-inspired modeling, and computational biology, enabling rapid, scalable, MSA-free protein predictions with broad practical impact.

Abstract

Computational biology and bioinformatics provide vast data gold-mines from protein sequences, ideal for Language Models taken from NLP. These LMs reach for new prediction frontiers at low inference costs. Here, we trained two auto-regressive models (Transformer-XL, XLNet) and four auto-encoder models (BERT, Albert, Electra, T5) on data from UniRef and BFD containing up to 393 billion amino acids. The LMs were trained on the Summit supercomputer using 5616 GPUs and TPU Pod up-to 1024 cores. Dimensionality reduction revealed that the raw protein LM-embeddings from unlabeled data captured some biophysical features of protein sequences. We validated the advantage of using the embeddings as exclusive input for several subsequent tasks. The first was a per-residue prediction of protein secondary structure (3-state accuracy Q3=81%-87%); the second were per-protein predictions of protein sub-cellular localization (ten-state accuracy: Q10=81%) and membrane vs. water-soluble (2-state accuracy Q2=91%). For the per-residue predictions the transfer of the most informative embeddings (ProtT5) for the first time outperformed the state-of-the-art without using evolutionary information thereby bypassing expensive database searches. Taken together, the results implied that protein LMs learned some of the grammar of the language of life. To facilitate future work, we released our models at https://github.com/agemagician/ProtTrans.

Paper Structure

This paper contains 23 sections, 19 figures, 11 tables.

Figures (19)

  • Figure 1: Feature extraction overview - We give a general overview on how ProtTrans models can be used to derive features (embeddings) for arbitrary protein sequences either on the level of single amino acids or whole proteins and how they can be used for classification tasks on both levels. First, an example protein sequence "SEQ" is tokenized and positional encoding is added. The resulting vectors are passed through any of our ProtTrans models to create contextaware embeddings for each input token, i.e. each amino acid. Here, we used the last hidden state of the Transformer's attention stack as input for downstream prediction methods. Those embeddings can either be used directly as input for prediction tasks on the level of individual tokens, e.g. a CNN can be used to predict an amino acid's secondary structure. Alternatively, those embeddings can be concatenated and pooled along the length-dimension to get fixedsize embedding irrespective of the input length, i.e., global average pooling is applied. The resulting protein-level embedding can be used as input for predicting aspects of proteins, e.g., a FNN can be used to predict a protein's cellular localization.
  • Figure 2: Large Scale Dataset Training: The figure shows the overhead of increasing the number of nodes/gpus for both ProtTXL (blue; low) and ProtBert (red; high). The overhead increases slightly from 1 to 2 nodes but remains constant even when scaling up to 936 nodes with a total of 5616 GPUs. Having a low overhead means the model has a near-linear scale-up across thousands of GPUs, upper-bounded by the theoretical scale-up.
  • Figure 3: Large Scale Deep Learning Training: The figures show the effect of enabling (red bars) or disabling (blue bars) large model support (LMS) on both, model size as well as batch size, when we tested ProtTXL or ProtBert on Nvidia V-100 16GB GPUs. It highlights the difference between applying LMS using PyTorch (ProtTXL) or tensorflow (ProtBert). Panel (a) shows the effect of using LMS on the maximum model size that can fit in a single V-100 memory. Panels (b,c) compare the effect of LMS on the maximum local (b) and global batch size (c) that can fit in the GPU. The number of hours required to finish a single epoch using 936 nodes, each with 6 GPUs when LMS being enabled is shown in (d).
  • Figure 4: Protein LMs learned constraints. t-SNE projections visualized information extracted by the unsupervised protein LMs (here bestperforming ProtT5-U50; upper row: before training (Random), and lower row: after pre-training on BFD & UniRef50. (A) The left-most column highlights single amino acids by biophysical features. (B) The middle column annotates protein structural class (taken from SCOPe). (C) The right-most column distinguishes proteins according to the kingdom of life in which it is native. Although the random projections on top may suggest some adequacy of the clusters, the trained models shown on the lower row clearly stood out. Incidentally, the comparison of the two also highlighted the potential pitfalls of using t-SNE projections from many dimensions onto 2D: although random, the human might see some correct patterns even in the top row. Most impressive might be the fine-grained distinctions of biophysical features of amino acids (A), however, more surprising are the classifications of entire proteins according to structural class (B) and organism (C). For these, we created embeddings through global average pooling over the representations extracted from the last layer of ProtT5-U50 (average over protein length, i.e. per-protein embeddings; Fig. 1 ).
  • Figure 5: Per-residue (token-level) performance for secondary structure prediction: CASP12 (red) and NEW364 (blue) constitute two test sets. Protein LMs trained here are shown in the left panel of the figure. Additions of BFD mark pre-training on the largest database BFD, U50 mark pre-training with BFD and refining with UniRef50. We included protein LMs described elsewhere (marked as: protein LMs others, namely ESM-1b 72, DeepProtVec and DeepSeqVec 32]. All embeddings were input to the same CNN architecture. Two approaches used amino acids instead of embeddings as input (marked as: no LMs: DeepOneHot [32] - one-hot encoding - and DeepBLOSUM62 [32] - input BLOSUM62 74] substitution matrix), as well as, to the current state-of-the-art (SOA) method NetSurfP-2.0 15), and Jpred4 75, RaptorX 76, 77, Spider3 78. The rightmost four methods use MSA as input (marked as: MSA input evolutionary information). While only rotT5-XL-U50 reached the SOA without using MSAs, several protein LMs outperformed other methods using MSA. All protein LMs other than the context-free DeepProtVec improved significantly over methods using only amino acid information as input. One interpretation of the difference between the two data sets is that CASP12 provided a lower and NEW364 an upper limit. The top row shows the complete range from$0-100$, while the lower row zooms into the range of differences relevant here.
  • ...and 14 more figures