Table of Contents
Fetching ...

Protein as a Second Language for LLMs

Xinhui Chen, Zuchao Li, Mengqi Gao, Yufeng Zhang, Chak Tou Leong, Haoyang Li, Jiaqi Chen

TL;DR

This work reframes protein sequences as a second language that can be learned by large language models through context-driven in-context learning, avoiding task-specific training. It introduces a bilingual protein–language QA dataset (79,926 triples) and an adaptive context construction framework that selects exemplars by sequence homology and textual similarity to ground protein functions. Across multiple open-source LLMs and GPT-4o, the approach yields consistent ROUGE-L improvements (average ~7%, up to 17.2%) and can even surpass fine-tuned protein-specific models, highlighting the potential of foundation models for scalable protein understanding. The dataset, methodology, and analyses offer a practical pathway to leverage generic LLMs for protein interpretation while maintaining rigor through human evaluations and reproducibility commitments, with attention to ethical considerations in biomedical knowledge use.

Abstract

Deciphering the function of unseen protein sequences is a fundamental challenge with broad scientific impact, yet most existing methods depend on task-specific adapters or large-scale supervised fine-tuning. We introduce the "Protein-as-Second-Language" framework, which reformulates amino-acid sequences as sentences in a novel symbolic language that large language models can interpret through contextual exemplars. Our approach adaptively constructs sequence-question-answer triples that reveal functional cues in a zero-shot setting, without any further training. To support this process, we curate a bilingual corpus of 79,926 protein-QA instances spanning attribute prediction, descriptive understanding, and extended reasoning. Empirically, our method delivers consistent gains across diverse open-source LLMs and GPT-4, achieving up to 17.2% ROUGE-L improvement (average +7%) and even surpassing fine-tuned protein-specific language models. These results highlight that generic LLMs, when guided with protein-as-language cues, can outperform domain-specialized models, offering a scalable pathway for protein understanding in foundation models.

Protein as a Second Language for LLMs

TL;DR

This work reframes protein sequences as a second language that can be learned by large language models through context-driven in-context learning, avoiding task-specific training. It introduces a bilingual protein–language QA dataset (79,926 triples) and an adaptive context construction framework that selects exemplars by sequence homology and textual similarity to ground protein functions. Across multiple open-source LLMs and GPT-4o, the approach yields consistent ROUGE-L improvements (average ~7%, up to 17.2%) and can even surpass fine-tuned protein-specific models, highlighting the potential of foundation models for scalable protein understanding. The dataset, methodology, and analyses offer a practical pathway to leverage generic LLMs for protein interpretation while maintaining rigor through human evaluations and reproducibility commitments, with attention to ethical considerations in biomedical knowledge use.

Abstract

Deciphering the function of unseen protein sequences is a fundamental challenge with broad scientific impact, yet most existing methods depend on task-specific adapters or large-scale supervised fine-tuning. We introduce the "Protein-as-Second-Language" framework, which reformulates amino-acid sequences as sentences in a novel symbolic language that large language models can interpret through contextual exemplars. Our approach adaptively constructs sequence-question-answer triples that reveal functional cues in a zero-shot setting, without any further training. To support this process, we curate a bilingual corpus of 79,926 protein-QA instances spanning attribute prediction, descriptive understanding, and extended reasoning. Empirically, our method delivers consistent gains across diverse open-source LLMs and GPT-4, achieving up to 17.2% ROUGE-L improvement (average +7%) and even surpassing fine-tuned protein-specific language models. These results highlight that generic LLMs, when guided with protein-as-language cues, can outperform domain-specialized models, offering a scalable pathway for protein understanding in foundation models.

Paper Structure

This paper contains 32 sections, 4 equations, 9 figures, 2 tables.

Figures (9)

  • Figure 1: The overview of data construction of our bilingual protein–QA dataset.
  • Figure 2: Process of Query-Adaptive Context Construction.
  • Figure 3: Dataset statistics. Left: Multidimensional analysis of protein amino-acid sequences, including length, domain composition, and catalytic activity. Right: Sample sizes for the four protein-QA types and the ratio of textual to amino-acid sequence tokens.
  • Figure 4: Comparison of human evaluation results. Left: Absolute human rating scores (0–5) for zero-shot model outputs (dark bars) and model outputs with adaptive context exposure (light bars) on three datasets. Right: Pairwise win/lose proportions comparing outputs with and without adaptive context exposure. Each comparison is based on 8 randomly selected cases per subset (48 cases in total across six subsets).
  • Figure 5: Effect of varying exemplar number ($k$) on model performance. We explored $k\in[1,12]$ as the search space; the upper bound was set after a coarse scan up to $k=50$ showed performance saturation around 2-12 exemplars. Metric: ROUGE-L.
  • ...and 4 more figures