xTrimoPGLM: Unified 100B-Scale Pre-trained Transformer for Deciphering the Language of Protein
Bo Chen, Xingyi Cheng, Pan Li, Yangli-ao Geng, Jing Gong, Shen Li, Zhilei Bei, Xu Tan, Boyan Wang, Xin Zeng, Chiming Liu, Aohan Zeng, Yuxiao Dong, Jie Tang, Le Song
TL;DR
This work introduces xTrimoPGLM, a unified 100B-parameter protein language model that jointly optimizes autoencoding (MLM) and autoregressive (GLM) objectives to excel at both understanding and generation tasks. Trained on ~1 trillion tokens with a two-stage curriculum, it delivers SOTA or near-SOTA performance across 18 protein benchmarks and enables high-quality 3D structure prediction via a PLM-based folding variant (xT-Fold). The authors further extend the framework to programmable protein generation through supervised fine-tuning (SFT) and reinforcement self-training (ReST), and demonstrate antibody-specific capabilities with xTrimoPGLM-Ab and xTrimoPGLM-AbFold, achieving rapid, MSA-free antibody structure prediction and controllable CD R3 region design. While highlighting scaling benefits, they also discuss practical challenges such as computational cost, generation hallucinations, and OOD generalization, and propose strategies like smooth transitions in training dynamics and potential retrieval-augmented setups. Overall, the work advances the protein foundation model landscape by unifying understanding and design, enabling versatile applications in protein engineering and drug discovery.
Abstract
Protein language models have shown remarkable success in learning biological information from protein sequences. However, most existing models are limited by either autoencoding or autoregressive pre-training objectives, which makes them struggle to handle protein understanding and generation tasks concurrently. We propose a unified protein language model, xTrimoPGLM, to address these two types of tasks simultaneously through an innovative pre-training framework. Our key technical contribution is an exploration of the compatibility and the potential for joint optimization of the two types of objectives, which has led to a strategy for training xTrimoPGLM at an unprecedented scale of 100 billion parameters and 1 trillion training tokens. Our extensive experiments reveal that 1) xTrimoPGLM significantly outperforms other advanced baselines in 18 protein understanding benchmarks across four categories. The model also facilitates an atomic-resolution view of protein structures, leading to an advanced 3D structural prediction model that surpasses existing language model-based tools. 2) xTrimoPGLM not only can generate de novo protein sequences following the principles of natural ones, but also can perform programmable generation after supervised fine-tuning (SFT) on curated sequences. These results highlight the substantial capability and versatility of xTrimoPGLM in understanding and generating protein sequences, contributing to the evolving landscape of foundation models in protein science.
