A PLMs based protein retrieval framework
Yuxuan Wu, Xiao Yi, Yang Tan, Huiqun Yu, Guisheng Fan, Gaowei Zheng
TL;DR
The paper tackles protein retrieval beyond strict sequence similarity by leveraging protein language models to generate rich embeddings and pairing them with an accelerated dense-vector index for scalable search. It proposes a two-module framework: (1) vectorization of protein sequences using PLMs, and (2) a retrieval engine built on VPTree, FAISS, and LSH to enable fast nearest-neighbor searches. A dedicated benchmarking pipeline using UniProt SwissProt with EC annotations—and AlphaFold2-predicted structures when needed—evaluates multiple PLMs (e.g., esm1b, esm2, prot_t5_xl_uniref50, ProtGPT2) against traditional tools like BLAST and Foldseek. Results show that PLM-based retrieval can identify functionally related proteins at low sequence identity, improving EC-hit rates and retrieval stability, thereby offering practical benefits for protein mining and biology.
Abstract
Protein retrieval, which targets the deconstruction of the relationship between sequences, structures and functions, empowers the advancing of biology. Basic Local Alignment Search Tool (BLAST), a sequence-similarity-based algorithm, has proved the efficiency of this field. Despite the existing tools for protein retrieval, they prioritize sequence similarity and probably overlook proteins that are dissimilar but share homology or functionality. In order to tackle this problem, we propose a novel protein retrieval framework that mitigates the bias towards sequence similarity. Our framework initiatively harnesses protein language models (PLMs) to embed protein sequences within a high-dimensional feature space, thereby enhancing the representation capacity for subsequent analysis. Subsequently, an accelerated indexed vector database is constructed to facilitate expedited access and retrieval of dense vectors. Extensive experiments demonstrate that our framework can equally retrieve both similar and dissimilar proteins. Moreover, this approach enables the identification of proteins that conventional methods fail to uncover. This framework will effectively assist in protein mining and empower the development of biology.
