Table of Contents
Fetching ...

Rethinking Text-based Protein Understanding: Retrieval or LLM?

Juntong Wu, Zijing Liu, He Cao, Hao Li, Bin Feng, Zishan Shu, Ke Yu, Li Yuan, Yu Li

TL;DR

This work interrogates the efficacy of text-based protein understanding by showing pervasive data leakage in existing benchmarks and that retrieval methods often outperform fine-tuned LLMs. It introduces RAPM, a Retrieval-Augmented Protein Modeling framework that builds a dual-index protein knowledge database and fuses retrieved evidence with LLM reasoning through a RAG-like prompt. To evaluate robustly, it proposes Pro-Inst-OOD, a unified, leakage-munified benchmark, and Entity-BLEU, a biology-focused metric that emphasizes entities over generic text. Empirical results demonstrate RAPM delivers strong performance on open-ended biological QA with training efficiency advantages, highlighting the value of hybrid retrieval and generation for protein understanding and emphasizing the need for rigorous, domain-aware evaluation metrics. The approach is formalized with a retrieval score s_i = $\alpha$ · sim_seq(s, s_i) + (1 − $\alpha$) · sim_emb(e, e_i) and is implemented in a way that enables efficient, scalable knowledge integration for LLM-based protein reasoning.

Abstract

In recent years, protein-text models have gained significant attention for their potential in protein generation and understanding. Current approaches focus on integrating protein-related knowledge into large language models through continued pretraining and multi-modal alignment, enabling simultaneous comprehension of textual descriptions and protein sequences. Through a thorough analysis of existing model architectures and text-based protein understanding benchmarks, we identify significant data leakage issues present in current benchmarks. Moreover, conventional metrics derived from natural language processing fail to accurately assess the model's performance in this domain. To address these limitations, we reorganize existing datasets and introduce a novel evaluation framework based on biological entities. Motivated by our observation, we propose a retrieval-enhanced method, which significantly outperforms fine-tuned LLMs for protein-to-text generation and shows accuracy and efficiency in training-free scenarios. Our code and data can be seen at https://github.com/IDEA-XL/RAPM.

Rethinking Text-based Protein Understanding: Retrieval or LLM?

TL;DR

This work interrogates the efficacy of text-based protein understanding by showing pervasive data leakage in existing benchmarks and that retrieval methods often outperform fine-tuned LLMs. It introduces RAPM, a Retrieval-Augmented Protein Modeling framework that builds a dual-index protein knowledge database and fuses retrieved evidence with LLM reasoning through a RAG-like prompt. To evaluate robustly, it proposes Pro-Inst-OOD, a unified, leakage-munified benchmark, and Entity-BLEU, a biology-focused metric that emphasizes entities over generic text. Empirical results demonstrate RAPM delivers strong performance on open-ended biological QA with training efficiency advantages, highlighting the value of hybrid retrieval and generation for protein understanding and emphasizing the need for rigorous, domain-aware evaluation metrics. The approach is formalized with a retrieval score s_i = · sim_seq(s, s_i) + (1 − ) · sim_emb(e, e_i) and is implemented in a way that enables efficient, scalable knowledge integration for LLM-based protein reasoning.

Abstract

In recent years, protein-text models have gained significant attention for their potential in protein generation and understanding. Current approaches focus on integrating protein-related knowledge into large language models through continued pretraining and multi-modal alignment, enabling simultaneous comprehension of textual descriptions and protein sequences. Through a thorough analysis of existing model architectures and text-based protein understanding benchmarks, we identify significant data leakage issues present in current benchmarks. Moreover, conventional metrics derived from natural language processing fail to accurately assess the model's performance in this domain. To address these limitations, we reorganize existing datasets and introduce a novel evaluation framework based on biological entities. Motivated by our observation, we propose a retrieval-enhanced method, which significantly outperforms fine-tuned LLMs for protein-to-text generation and shows accuracy and efficiency in training-free scenarios. Our code and data can be seen at https://github.com/IDEA-XL/RAPM.

Paper Structure

This paper contains 63 sections, 5 equations, 7 figures, 13 tables.

Figures (7)

  • Figure 1: a) Protein understanding tasks, and LLM-based and retrieval-based methods for this task. b) The performance of existing methods in protein understanding tasks. Retrieval methods based on protein embeddings or sequences outperform LLM-based approaches.
  • Figure 2: a) Three typical LLM-based approaches for text-based protein understanding. b) Simple nearest-neighbor based retrieval with protein embedding or sequence similarities.
  • Figure 3: We evaluate the degree of data leakage in both existing benchmarks and OOD benchmarks. "Leakage" is defined as the probability that test set samples can directly retrieve similar samples with the same label from the training set.
  • Figure 4: The ROUGE-L score distributions of retrieval-based methods versus LLM-based methods for all test samples in the General Function task.
  • Figure 5: We collect protein-annotation pairs from existing protein annotation databases for the Protein-Knowledge Database construction. We extract dense features of proteins using a Protein Encoder and build database indices using two indexing methods. For entries sharing identical labels, we incorporate meta-features into the database. For downstream queries, we combine scores from both indices to retrieve the Top-K relevant entities, then construct retrieval-augmented prompts after quantizing sequence similarity into High, Mid, and Low confidence levels.
  • ...and 2 more figures