Table of Contents
Fetching ...

xTrimoPGLM: Unified 100B-Scale Pre-trained Transformer for Deciphering the Language of Protein

Bo Chen, Xingyi Cheng, Pan Li, Yangli-ao Geng, Jing Gong, Shen Li, Zhilei Bei, Xu Tan, Boyan Wang, Xin Zeng, Chiming Liu, Aohan Zeng, Yuxiao Dong, Jie Tang, Le Song

TL;DR

This work introduces xTrimoPGLM, a unified 100B-parameter protein language model that jointly optimizes autoencoding (MLM) and autoregressive (GLM) objectives to excel at both understanding and generation tasks. Trained on ~1 trillion tokens with a two-stage curriculum, it delivers SOTA or near-SOTA performance across 18 protein benchmarks and enables high-quality 3D structure prediction via a PLM-based folding variant (xT-Fold). The authors further extend the framework to programmable protein generation through supervised fine-tuning (SFT) and reinforcement self-training (ReST), and demonstrate antibody-specific capabilities with xTrimoPGLM-Ab and xTrimoPGLM-AbFold, achieving rapid, MSA-free antibody structure prediction and controllable CD R3 region design. While highlighting scaling benefits, they also discuss practical challenges such as computational cost, generation hallucinations, and OOD generalization, and propose strategies like smooth transitions in training dynamics and potential retrieval-augmented setups. Overall, the work advances the protein foundation model landscape by unifying understanding and design, enabling versatile applications in protein engineering and drug discovery.

Abstract

Protein language models have shown remarkable success in learning biological information from protein sequences. However, most existing models are limited by either autoencoding or autoregressive pre-training objectives, which makes them struggle to handle protein understanding and generation tasks concurrently. We propose a unified protein language model, xTrimoPGLM, to address these two types of tasks simultaneously through an innovative pre-training framework. Our key technical contribution is an exploration of the compatibility and the potential for joint optimization of the two types of objectives, which has led to a strategy for training xTrimoPGLM at an unprecedented scale of 100 billion parameters and 1 trillion training tokens. Our extensive experiments reveal that 1) xTrimoPGLM significantly outperforms other advanced baselines in 18 protein understanding benchmarks across four categories. The model also facilitates an atomic-resolution view of protein structures, leading to an advanced 3D structural prediction model that surpasses existing language model-based tools. 2) xTrimoPGLM not only can generate de novo protein sequences following the principles of natural ones, but also can perform programmable generation after supervised fine-tuning (SFT) on curated sequences. These results highlight the substantial capability and versatility of xTrimoPGLM in understanding and generating protein sequences, contributing to the evolving landscape of foundation models in protein science.

xTrimoPGLM: Unified 100B-Scale Pre-trained Transformer for Deciphering the Language of Protein

TL;DR

This work introduces xTrimoPGLM, a unified 100B-parameter protein language model that jointly optimizes autoencoding (MLM) and autoregressive (GLM) objectives to excel at both understanding and generation tasks. Trained on ~1 trillion tokens with a two-stage curriculum, it delivers SOTA or near-SOTA performance across 18 protein benchmarks and enables high-quality 3D structure prediction via a PLM-based folding variant (xT-Fold). The authors further extend the framework to programmable protein generation through supervised fine-tuning (SFT) and reinforcement self-training (ReST), and demonstrate antibody-specific capabilities with xTrimoPGLM-Ab and xTrimoPGLM-AbFold, achieving rapid, MSA-free antibody structure prediction and controllable CD R3 region design. While highlighting scaling benefits, they also discuss practical challenges such as computational cost, generation hallucinations, and OOD generalization, and propose strategies like smooth transitions in training dynamics and potential retrieval-augmented setups. Overall, the work advances the protein foundation model landscape by unifying understanding and design, enabling versatile applications in protein engineering and drug discovery.

Abstract

Protein language models have shown remarkable success in learning biological information from protein sequences. However, most existing models are limited by either autoencoding or autoregressive pre-training objectives, which makes them struggle to handle protein understanding and generation tasks concurrently. We propose a unified protein language model, xTrimoPGLM, to address these two types of tasks simultaneously through an innovative pre-training framework. Our key technical contribution is an exploration of the compatibility and the potential for joint optimization of the two types of objectives, which has led to a strategy for training xTrimoPGLM at an unprecedented scale of 100 billion parameters and 1 trillion training tokens. Our extensive experiments reveal that 1) xTrimoPGLM significantly outperforms other advanced baselines in 18 protein understanding benchmarks across four categories. The model also facilitates an atomic-resolution view of protein structures, leading to an advanced 3D structural prediction model that surpasses existing language model-based tools. 2) xTrimoPGLM not only can generate de novo protein sequences following the principles of natural ones, but also can perform programmable generation after supervised fine-tuning (SFT) on curated sequences. These results highlight the substantial capability and versatility of xTrimoPGLM in understanding and generating protein sequences, contributing to the evolving landscape of foundation models in protein science.
Paper Structure (14 sections, 8 equations, 20 figures, 14 tables)

This paper contains 14 sections, 8 equations, 20 figures, 14 tables.

Figures (20)

  • Figure 1: Comprehensive Insights into xTrimoPGLM.A. The pre-training and fine-tuning stages of xTrimoPGLM, combining BERT-style (blue and purple, for masking and predicting tokens) and GPT-style (green, from [S] to [E], for autoregressive generation) objectives. The prefix's bidirectional attention facilitates protein understanding tasks like structure prediction, while the suffix supports both de novo and conditional protein design through sequence generation. B. xTrimoPGLM shows lower perplexity than other leading PLMs like ESM2 and PROGEN2-xlarge in evaluations on two distinct out-of-distribution datasets, indicating its advanced performance. C. The scaling behavior of xTrimoPGLM-series from 1 million to 1 billion parameters, trained with 100 billion tokens, demonstrating xTrimoPGLM-100B's efficiency through a power law fit of training losses against computational resources.
  • Figure 2: The Performance of Protein Understanding Benchmark.A. For the classification task, four metrics are employed (Supplementary Section \ref{['sec::downstream_tasks']}) including TopL/5 accuracy (Contact map), accuracy (Fold classification, Secondary structure, Antibiotic resistance, Solubility, Localization, Metal ion binding), AUC (Peptide-HLA/MHC affinity, TCR-pMHC affinity, Clone CLF, Material production) and Matthews Correlation Coef. (Temperature stability). For the regression task, two metrics are used including the Spearman Correlation Coef. (Fluorescence, Fitness, Stability, Optimal temperature, Optimal PH) and the Pearson Correlation Coef. (Enzyme catalytic efficiency). B. The scaling trend between the computational cost of model training, quantified by PF-days, where one PF-day$=$ 8.64 × $10^{19}$ FLOPs, and the model performance Each data point symbolizes the mean performance metric for a specific task category (Pb for Probing and Ft for Fine-tuning with LoRA). E150M/650M/3B/15B, and xT100B represent ESM2-150M/650M/3B/15B, and xTrimoPGLM-100B, respectively. C. Correlations between the pre-training validation loss measured by MLM objective and the performance of the downstream tasks. To facilitate comparison, we normalize this performance by subtracting the mean value and dividing it by the standard deviation.
  • Figure 3: Structure Prediction with xT-Fold.A. xT-Fold architecture leverages a Multi-Layer Perceptron (MLP) to convert PLM representations into inputs for the folding modules, which generate 3D coordinates and pLDDT confidence scores. B. TM-score benchmarks for structure prediction models. The bar chart shows the performance of single-sequence PLM-based models and MSA-based models on CAMEO and CASP15 datasets. C. Inference time comparison across models for varying sequence length intervals, showing xT-Fold, ESMFold, OmegaFold, AlphaFold, RoseTTAFold, and MSA search times in seconds.D. Scatter plots compare xT-Fold predictions (x-axis) to other models (y-axis), color-coded by perplexity (green for high, purple for low).
  • Figure 4: Diversification of Generated Proteins by xTrimoPGLM.A. Violin plots comparing ESMFold-predicted confidence (pLDDT scores) and similarity to Protein Data Bank (PDB) entries—measured by TM-score and sequence identity—for sequences generated by xTrimoPGLM (green, $N$ = 14,626) and PROGEN2-xlarge (blue, $N$ = 8,466). The plots display the median, upper and lower quartiles, and whiskers representing 1.5$\times$ the interquartile range. B. The comprehensive mapping of protein structural space as informed by sequences generated by xTrimoPGLM. Each node represents a sequence generated by xTrimoPGLM or a sequence from SCOPe70_2.08. Two nodes are linked when one of them can be searched from SCOPe70_2.08 with an alignment of at least 20 amino acids and 70% hhsearch probability. The color coding corresponds to distinct SCOP structural classes, with xTrimoPGLM-generated sequences highlighted in white. For illustrations (Supplementary Figure \ref{['fig:app_gen']}), we showcase 6 examples from generated sequences. The PDB chain ID with the highest structural similarity to the generated sequence, their sequence identity, and TM-score are displayed above each example. The color of the structure matches the xT-Fold pLDDT values. The blue color represents high confidence (pLDDT>90). 4nnz_A: Probable zinc protease. 2iah_A: Ferripyoverdine receptor. 6vby_A: Cinnamic acid 4-hydroxylase. 4v37_B: Betaine aldehyde dehydrogenase. 2nuu_E: Ammonia channel. 5msq_A: Carboxylic acid reductase.
  • Figure 5: Robust Alignment Capabilities of xTrimoPGLM in Protein Sequence Generation towards Desired Properties. Quantitative analysis of xTrimoPGLM, enhanced with Supervised Fine-Tuning (SFT) and Reinforcement Self-Training (ReST), across five selected tasks. The number of sampled generated sequences, $N$(other), and natural sequences, $N$(nature), used in the analysis are illustrated in the figure. The results demonstrate xTrimoPGLM’s effectiveness in aligning with specific task objectives, as shown by the average scores (higher scores indicate better alignment). P.Gen2 refers to the PROGEN2-xlarge model nijkamp2023progen2 with 6.4 billion parameters, and P.GPT2 denotes the ProtGPT2 model ferruz2022protgpt2 with 740 million parameters.
  • ...and 15 more figures