Table of Contents
Fetching ...

Scaling Up ESM2 Architectures for Long Protein Sequences Analysis: Long and Quantized Approaches

Gabriel Bianchin de Oliveira, Helio Pedrini, Zanoni Dias

TL;DR

The paper extends ESM2 architectures to handle proteins up to 2{,}048 amino acids by introducing long and quantized variants. It adopts a Longformer-inspired local attention with a fixed window $k$ and context copying to 2{,}050 tokens, while offering int4 quantization and pretraining on a large UniProt subset. Evaluations on CAFA5 protein-function prediction show that long and quantized embeddings often surpass standard ESM2, especially for longer sequences. The work highlights practical pathways to deploy large protein-language models with reduced memory and faster inference, and points to future applications across additional protein-analysis tasks.

Abstract

Various approaches utilizing Transformer architectures have achieved state-of-the-art results in Natural Language Processing (NLP). Based on this success, numerous architectures have been proposed for other types of data, such as in biology, particularly for protein sequences. Notably among these are the ESM2 architectures, pre-trained on billions of proteins, which form the basis of various state-of-the-art approaches in the field. However, the ESM2 architectures have a limitation regarding input size, restricting it to 1,022 amino acids, which necessitates the use of preprocessing techniques to handle sequences longer than this limit. In this paper, we present the long and quantized versions of the ESM2 architectures, doubling the input size limit to 2,048 amino acids.

Scaling Up ESM2 Architectures for Long Protein Sequences Analysis: Long and Quantized Approaches

TL;DR

The paper extends ESM2 architectures to handle proteins up to 2{,}048 amino acids by introducing long and quantized variants. It adopts a Longformer-inspired local attention with a fixed window and context copying to 2{,}050 tokens, while offering int4 quantization and pretraining on a large UniProt subset. Evaluations on CAFA5 protein-function prediction show that long and quantized embeddings often surpass standard ESM2, especially for longer sequences. The work highlights practical pathways to deploy large protein-language models with reduced memory and faster inference, and points to future applications across additional protein-analysis tasks.

Abstract

Various approaches utilizing Transformer architectures have achieved state-of-the-art results in Natural Language Processing (NLP). Based on this success, numerous architectures have been proposed for other types of data, such as in biology, particularly for protein sequences. Notably among these are the ESM2 architectures, pre-trained on billions of proteins, which form the basis of various state-of-the-art approaches in the field. However, the ESM2 architectures have a limitation regarding input size, restricting it to 1,022 amino acids, which necessitates the use of preprocessing techniques to handle sequences longer than this limit. In this paper, we present the long and quantized versions of the ESM2 architectures, doubling the input size limit to 2,048 amino acids.
Paper Structure (4 sections, 3 equations, 2 figures, 5 tables)

This paper contains 4 sections, 3 equations, 2 figures, 5 tables.

Figures (2)

  • Figure 1: Self-attention mechanisms. In the global self-attention mechanism, each amino acid examines all the amino acids in the sequence. In the local self-attention mechanism, each amino acid examines the amino acids within a specific window.
  • Figure 2: Pipeline for evaluating protein embeddings from ESM2 architectures. The method receives the amino acid sequence as input. Then, the features from the last layer of the backbone are used to train a classifier. During the classification step, the best classification model is identified using AutoML.