Table of Contents
Fetching ...

A PLMs based protein retrieval framework

Yuxuan Wu, Xiao Yi, Yang Tan, Huiqun Yu, Guisheng Fan, Gaowei Zheng

TL;DR

The paper tackles protein retrieval beyond strict sequence similarity by leveraging protein language models to generate rich embeddings and pairing them with an accelerated dense-vector index for scalable search. It proposes a two-module framework: (1) vectorization of protein sequences using PLMs, and (2) a retrieval engine built on VPTree, FAISS, and LSH to enable fast nearest-neighbor searches. A dedicated benchmarking pipeline using UniProt SwissProt with EC annotations—and AlphaFold2-predicted structures when needed—evaluates multiple PLMs (e.g., esm1b, esm2, prot_t5_xl_uniref50, ProtGPT2) against traditional tools like BLAST and Foldseek. Results show that PLM-based retrieval can identify functionally related proteins at low sequence identity, improving EC-hit rates and retrieval stability, thereby offering practical benefits for protein mining and biology.

Abstract

Protein retrieval, which targets the deconstruction of the relationship between sequences, structures and functions, empowers the advancing of biology. Basic Local Alignment Search Tool (BLAST), a sequence-similarity-based algorithm, has proved the efficiency of this field. Despite the existing tools for protein retrieval, they prioritize sequence similarity and probably overlook proteins that are dissimilar but share homology or functionality. In order to tackle this problem, we propose a novel protein retrieval framework that mitigates the bias towards sequence similarity. Our framework initiatively harnesses protein language models (PLMs) to embed protein sequences within a high-dimensional feature space, thereby enhancing the representation capacity for subsequent analysis. Subsequently, an accelerated indexed vector database is constructed to facilitate expedited access and retrieval of dense vectors. Extensive experiments demonstrate that our framework can equally retrieve both similar and dissimilar proteins. Moreover, this approach enables the identification of proteins that conventional methods fail to uncover. This framework will effectively assist in protein mining and empower the development of biology.

A PLMs based protein retrieval framework

TL;DR

The paper tackles protein retrieval beyond strict sequence similarity by leveraging protein language models to generate rich embeddings and pairing them with an accelerated dense-vector index for scalable search. It proposes a two-module framework: (1) vectorization of protein sequences using PLMs, and (2) a retrieval engine built on VPTree, FAISS, and LSH to enable fast nearest-neighbor searches. A dedicated benchmarking pipeline using UniProt SwissProt with EC annotations—and AlphaFold2-predicted structures when needed—evaluates multiple PLMs (e.g., esm1b, esm2, prot_t5_xl_uniref50, ProtGPT2) against traditional tools like BLAST and Foldseek. Results show that PLM-based retrieval can identify functionally related proteins at low sequence identity, improving EC-hit rates and retrieval stability, thereby offering practical benefits for protein mining and biology.

Abstract

Protein retrieval, which targets the deconstruction of the relationship between sequences, structures and functions, empowers the advancing of biology. Basic Local Alignment Search Tool (BLAST), a sequence-similarity-based algorithm, has proved the efficiency of this field. Despite the existing tools for protein retrieval, they prioritize sequence similarity and probably overlook proteins that are dissimilar but share homology or functionality. In order to tackle this problem, we propose a novel protein retrieval framework that mitigates the bias towards sequence similarity. Our framework initiatively harnesses protein language models (PLMs) to embed protein sequences within a high-dimensional feature space, thereby enhancing the representation capacity for subsequent analysis. Subsequently, an accelerated indexed vector database is constructed to facilitate expedited access and retrieval of dense vectors. Extensive experiments demonstrate that our framework can equally retrieve both similar and dissimilar proteins. Moreover, this approach enables the identification of proteins that conventional methods fail to uncover. This framework will effectively assist in protein mining and empower the development of biology.
Paper Structure (18 sections, 5 equations, 12 figures, 6 tables, 1 algorithm)

This paper contains 18 sections, 5 equations, 12 figures, 6 tables, 1 algorithm.

Figures (12)

  • Figure 1: BLAST algorithm diagram
  • Figure 2: A PLMs based Protein Retrieval Framework
  • Figure 3: Embedding of protein sequence
  • Figure 4: Multi-layer acceleration structure
  • Figure 5: Protein retrieval multi-level matching diagram: In the diagram, protein $\alpha$ refers to A0A1D6EFT8 with an EC number of 4.2.3.197, while protein $\beta$ refers to B2C4D0 with an EC number of 4.2.3.57. The degree of matching between protein $\alpha$ and protein $\beta$ is assessed at four levels. Firstly, a level 4 evaluation is performed, and the intersection between 4.2.3.197 and 4.2.3.57 is empty, indicating no level 4 match. Next, a level 3 evaluation is conducted, and the intersection between 4.2.3 and 4.2.3 is not empty, indicating a level 3 match between protein $\alpha$ and protein $\beta$. Similarly, protein $\alpha$ and protein $\beta$ also pass the level 2 and level 1 evaluations. Based on the above information, we conclude that the degree of matching between protein A0A1D6EFT8 and protein B2C4D0 is at level 3.
  • ...and 7 more figures