Table of Contents
Fetching ...

Prot2Token: A Unified Framework for Protein Modeling via Next-Token Prediction

Mahdi Pourmirzaei, Farzaneh Esmaili, Salhuldin Alqarghuli, Mohammadreza Pourmirzaei, Ye Han, Kai Chen, Mohsen Rezaei, Duolin Wang, Dong Xu

TL;DR

Prot2Token introduces a unified tokenization scheme that maps a broad spectrum of protein-prediction tasks into a standardized next-token prediction framework. By coupling a pre-trained protein encoder with an autoregressive decoder and learnable task tokens, it enables multi-task learning across classification, regression, binding-site, sequence-to-sequence, and other tasks using a single architecture. Empirical results show competitive performance across benchmarks, with substantial efficiency gains—most notably a ~1000x acceleration in 3D structure generation compared to AlphaFold2 on the same hardware. The work also demonstrates the value of self-supervised pre-training for the decoder and highlights opportunities to extend the approach toward higher-fidelity 3D tokenizers and generative protein design. Overall, Prot2Token proposes a scalable, promptable pathway to standardize protein prediction within a generative interface, potentially accelerating discovery and therapeutics while prompting careful consideration of ethical and security implications.

Abstract

The diverse nature of protein prediction tasks has traditionally necessitated specialized models, hindering the development of broadly applicable and computationally efficient Protein Language Models (PLMs). In this work, we introduce Prot2Token, a unified framework that overcomes these challenges by converting a wide spectrum of protein-related predictions-from sequence-level properties and residue-specific attributes to complex inter-protein interactions-into a standardized next-token prediction format. At its core, Prot2Token employs an autoregressive decoder, conditioned on embeddings from pre-trained protein encoders and guided by learnable task tokens, to perform diverse predictions. This architecture uniquely facilitates multi-task learning, enabling general-purpose decoders to generalize across five distinct categories. We present extensive experimental validation across a variety of benchmarks, demonstrating Prot2Token's predictive power in different types of protein-prediction tasks. In 3D structure prediction, Prot2Token delivers substantial speedups (up to 1000x faster than AlphaFold2 with MSA on the same hardware) while, across other numerous tasks, matching or surpassing specialized methods. Beyond that, we introduce an auxiliary self-supervised decoder pre-training approach to improve spatially sensitive task performance. Prot2Token thus offers a step towards standardizing biological prediction into a generative interface, promising to accelerate biological discovery and the development of novel therapeutics. The code is available at https://github.com/mahdip72/prot2token .

Prot2Token: A Unified Framework for Protein Modeling via Next-Token Prediction

TL;DR

Prot2Token introduces a unified tokenization scheme that maps a broad spectrum of protein-prediction tasks into a standardized next-token prediction framework. By coupling a pre-trained protein encoder with an autoregressive decoder and learnable task tokens, it enables multi-task learning across classification, regression, binding-site, sequence-to-sequence, and other tasks using a single architecture. Empirical results show competitive performance across benchmarks, with substantial efficiency gains—most notably a ~1000x acceleration in 3D structure generation compared to AlphaFold2 on the same hardware. The work also demonstrates the value of self-supervised pre-training for the decoder and highlights opportunities to extend the approach toward higher-fidelity 3D tokenizers and generative protein design. Overall, Prot2Token proposes a scalable, promptable pathway to standardize protein prediction within a generative interface, potentially accelerating discovery and therapeutics while prompting careful consideration of ethical and security implications.

Abstract

The diverse nature of protein prediction tasks has traditionally necessitated specialized models, hindering the development of broadly applicable and computationally efficient Protein Language Models (PLMs). In this work, we introduce Prot2Token, a unified framework that overcomes these challenges by converting a wide spectrum of protein-related predictions-from sequence-level properties and residue-specific attributes to complex inter-protein interactions-into a standardized next-token prediction format. At its core, Prot2Token employs an autoregressive decoder, conditioned on embeddings from pre-trained protein encoders and guided by learnable task tokens, to perform diverse predictions. This architecture uniquely facilitates multi-task learning, enabling general-purpose decoders to generalize across five distinct categories. We present extensive experimental validation across a variety of benchmarks, demonstrating Prot2Token's predictive power in different types of protein-prediction tasks. In 3D structure prediction, Prot2Token delivers substantial speedups (up to 1000x faster than AlphaFold2 with MSA on the same hardware) while, across other numerous tasks, matching or surpassing specialized methods. Beyond that, we introduce an auxiliary self-supervised decoder pre-training approach to improve spatially sensitive task performance. Prot2Token thus offers a step towards standardizing biological prediction into a generative interface, promising to accelerate biological discovery and the development of novel therapeutics. The code is available at https://github.com/mahdip72/prot2token .

Paper Structure

This paper contains 55 sections, 8 equations, 24 figures, 34 tables, 1 algorithm.

Figures (24)

  • Figure 1: High-level architecture of Prot2Token highlighting multi-task capability in protein-level, residue-level, and protein-protein level tasks.
  • Figure 2: Detailed Architecture of Prot2Token Highlighting Multi-Task Capability. This diagram shows the Prot2Token components: a bidirectional Protein encoder and an optional Chemical Encoder, a Fusion block part, and an autoregressive Decoder guided by Task Token Embeddings for various prediction tasks (examples listed). This illustrates the framework's potential for simultaneous multi-task learning; however, practical training of this work only focused on combinations of fewer tasks due to computational costs, demonstrating the principle.
  • Figure 3: Prot2Token converts heterogeneous labels into uniform sequences: examples illustrate the five tokenization categories—(i) sequence-to-sequence, (ii) classification (multi-class/ multi-label), (iii) regression, (iv) binding-site indexing, and (v) other composite outputs such as PTM—highlighting the framework’s task-agnostic decoding format.
  • Figure 4: Task-token prompting and loss masking in the Prot2Token decoder. (A) Standard decoding starts with a <BOS> token and predicts label tokens, computing loss over all positions. (B) Prompted decoding inserts a task token ($T_1$) before labels; this token is zero-weighted in the loss, guiding the model without affecting training error.
  • Figure 5: Tokenisation workflow for protein–protein binding sites. A distance cut-off is applied to a residue–residue distance matrix derived from the PDB complex to flag contacting residues. Rows with at least one contact are collapsed into a sorted list of residue indices, which becomes the target token sequence.
  • ...and 19 more figures