Table of Contents
Fetching ...

ProtTeX: Structure-In-Context Reasoning and Editing of Proteins with Large Language Models

Zicheng Ma, Chuanliu Fan, Zhicong Wang, Zhenyu Chen, Xiaohan Lin, Yanheng Li, Shihao Feng, Jun Zhang, Ziqiang Cao, Yi Qin Gao

TL;DR

ProtTeX introduces a unified tokenization framework that merges protein sequences, 3D structures, and natural language into a discrete space for decoder-only LLMs, enabling structure-informed in-context reasoning via Next-Token Prediction. By leveraging vector-quantized structure tokens, SE(3)-invariant encoders/decoders, and interleaved multimodal prompts, ProtTeX achieves state-of-the-art performance in protein function understanding, structure generation, and design across PFUD, PSAD, PSPD, and PDD datasets, outperforming domain-specific baselines. The work further demonstrates Chain-of-Thought reasoning across modalities, improving exact-match metrics and enabling multimodal structure–function reasoning with sampling strategies like Beam Search with Lowest perplexity and nucleus sampling. Controllable protein design is showcased through design prompts that yield self-consistent folding and preserved active-site features, suggesting practical utility in accelerated protein engineering. Looking ahead, scaling to larger LLMs, applying reinforcement-learning-based alignment, and refining inference-time self-improvement could further amplify ProtTeX’s capabilities for multimodal biomolecular reasoning.

Abstract

Large language models have made remarkable progress in the field of molecular science, particularly in understanding and generating functional small molecules. This success is largely attributed to the effectiveness of molecular tokenization strategies. In protein science, the amino acid sequence serves as the sole tokenizer for LLMs. However, many fundamental challenges in protein science are inherently structure-dependent. The absence of structure-aware tokens significantly limits the capabilities of LLMs for comprehensive biomolecular comprehension and multimodal generation. To address these challenges, we introduce a novel framework, ProtTeX, which tokenizes the protein sequences, structures, and textual information into a unified discrete space. This innovative approach enables joint training of the LLM exclusively through the Next-Token Prediction paradigm, facilitating multimodal protein reasoning and generation. ProtTeX enables general LLMs to perceive and process protein structures through sequential text input, leverage structural information as intermediate reasoning components, and generate or manipulate structures via sequential text output. Experiments demonstrate that our model achieves significant improvements in protein function prediction, outperforming the state-of-the-art domain expert model with a twofold increase in accuracy. Our framework enables high-quality conformational generation and customizable protein design. For the first time, we demonstrate that by adopting the standard training and inference pipelines from the LLM domain, ProtTeX empowers decoder-only LLMs to effectively address diverse spectrum of protein-related tasks.

ProtTeX: Structure-In-Context Reasoning and Editing of Proteins with Large Language Models

TL;DR

ProtTeX introduces a unified tokenization framework that merges protein sequences, 3D structures, and natural language into a discrete space for decoder-only LLMs, enabling structure-informed in-context reasoning via Next-Token Prediction. By leveraging vector-quantized structure tokens, SE(3)-invariant encoders/decoders, and interleaved multimodal prompts, ProtTeX achieves state-of-the-art performance in protein function understanding, structure generation, and design across PFUD, PSAD, PSPD, and PDD datasets, outperforming domain-specific baselines. The work further demonstrates Chain-of-Thought reasoning across modalities, improving exact-match metrics and enabling multimodal structure–function reasoning with sampling strategies like Beam Search with Lowest perplexity and nucleus sampling. Controllable protein design is showcased through design prompts that yield self-consistent folding and preserved active-site features, suggesting practical utility in accelerated protein engineering. Looking ahead, scaling to larger LLMs, applying reinforcement-learning-based alignment, and refining inference-time self-improvement could further amplify ProtTeX’s capabilities for multimodal biomolecular reasoning.

Abstract

Large language models have made remarkable progress in the field of molecular science, particularly in understanding and generating functional small molecules. This success is largely attributed to the effectiveness of molecular tokenization strategies. In protein science, the amino acid sequence serves as the sole tokenizer for LLMs. However, many fundamental challenges in protein science are inherently structure-dependent. The absence of structure-aware tokens significantly limits the capabilities of LLMs for comprehensive biomolecular comprehension and multimodal generation. To address these challenges, we introduce a novel framework, ProtTeX, which tokenizes the protein sequences, structures, and textual information into a unified discrete space. This innovative approach enables joint training of the LLM exclusively through the Next-Token Prediction paradigm, facilitating multimodal protein reasoning and generation. ProtTeX enables general LLMs to perceive and process protein structures through sequential text input, leverage structural information as intermediate reasoning components, and generate or manipulate structures via sequential text output. Experiments demonstrate that our model achieves significant improvements in protein function prediction, outperforming the state-of-the-art domain expert model with a twofold increase in accuracy. Our framework enables high-quality conformational generation and customizable protein design. For the first time, we demonstrate that by adopting the standard training and inference pipelines from the LLM domain, ProtTeX empowers decoder-only LLMs to effectively address diverse spectrum of protein-related tasks.

Paper Structure

This paper contains 21 sections, 8 equations, 9 figures, 4 tables.

Figures (9)

  • Figure 1: (A) Overview of model architecture. (B) Structure-In-Context schematic diagram. The model enables the protein structures as input, output or CoT intermediate. (C) Prompt template of different dataset.
  • Figure 2: Heatmap illustrates the Exact Match Jaccard Index (EMJI) of various models across different protein understanding tasks in the PFUD test set, including Molecular Function (n=1,127), Subcellular Location (n=2,071), Biological Process (n=459), Domains or Motifs (n=886), and Multi-Attribute (n=974). The best-performing metric for each task is highlighted in bold.
  • Figure 3: Multimodal chain-of-thought with multi-round chat. (A) Direct Prompting, direct asking the question of protein structure or protein function. (B) Chain-of-Thought Prompting, first analyzes the sequence, then generates the structure, and subsequently infers the function step-by-step. The Llama icon is sourced from https://github.com/alexrozanski/LlamaChat.
  • Figure 4: Multimodal Chain-of-Thought Reasoning performance. (A) Bar plot comparing the performance scores of subcellular location prediction between Direct Prompting and CoT Prompting (n=1978). (B) Scatter plot illustrating the negative correlation between perplexity and TM-score of predicted structures. The Pearson correlation coefficient and corresponding p-value are provided in the legend. (C) & (D) Comparison of structure prediction performance across Beam Search with Lowest Perplexity and Greedy Search strategies on PSPD test set (n = 500). (E) & (F) Comparison of structure prediction performance between Direct Prompting and CoT Prompting on PSAD test set (n=500).
  • Figure 5: Multi-conformation sampling for fold-switching proteins (A) KaiB, (B) MAD2 and (C) RfaH
  • ...and 4 more figures