Table of Contents
Fetching ...

PQA: Zero-shot Protein Question Answering for Free-form Scientific Enquiry with Large Language Models

Eli M Carrami, Sahand Sharifzadeh

TL;DR

This work proposes zero-shot Protein Question Answering (PQA), a task designed to answer a wide range of protein-related queries without task-specific training, and introduces the Pika framework, a curated, debiased dataset tailored for PQA and a biochemically relevant benchmarking strategy.

Abstract

Understanding protein structure and function is crucial in biology. However, current computational methods are often task-specific and resource-intensive. To address this, we propose zero-shot Protein Question Answering (PQA), a task designed to answer a wide range of protein-related queries without task-specific training. The success of PQA hinges on high-quality datasets and robust evaluation strategies, both of which are lacking in current research. Existing datasets suffer from biases, noise, and lack of evolutionary context, while current evaluation methods fail to accurately assess model performance. We introduce the Pika framework to overcome these limitations. Pika comprises a curated, debiased dataset tailored for PQA and a biochemically relevant benchmarking strategy. We also propose multimodal large language models as a strong baseline for PQA, leveraging their natural language processing and knowledge. This approach promises a more flexible and efficient way to explore protein properties, advancing protein research. Our comprehensive PQA framework, Pika, including dataset, code, and model checkpoints, is openly accessible on github.com/EMCarrami/Pika, promoting wider research in the field.

PQA: Zero-shot Protein Question Answering for Free-form Scientific Enquiry with Large Language Models

TL;DR

This work proposes zero-shot Protein Question Answering (PQA), a task designed to answer a wide range of protein-related queries without task-specific training, and introduces the Pika framework, a curated, debiased dataset tailored for PQA and a biochemically relevant benchmarking strategy.

Abstract

Understanding protein structure and function is crucial in biology. However, current computational methods are often task-specific and resource-intensive. To address this, we propose zero-shot Protein Question Answering (PQA), a task designed to answer a wide range of protein-related queries without task-specific training. The success of PQA hinges on high-quality datasets and robust evaluation strategies, both of which are lacking in current research. Existing datasets suffer from biases, noise, and lack of evolutionary context, while current evaluation methods fail to accurately assess model performance. We introduce the Pika framework to overcome these limitations. Pika comprises a curated, debiased dataset tailored for PQA and a biochemically relevant benchmarking strategy. We also propose multimodal large language models as a strong baseline for PQA, leveraging their natural language processing and knowledge. This approach promises a more flexible and efficient way to explore protein properties, advancing protein research. Our comprehensive PQA framework, Pika, including dataset, code, and model checkpoints, is openly accessible on github.com/EMCarrami/Pika, promoting wider research in the field.
Paper Structure (47 sections, 7 figures, 9 tables)

This paper contains 47 sections, 7 figures, 9 tables.

Figures (7)

  • Figure 1: Schematic of Pika framework. Pika-DS is created from filtered SwissProt entries followed by processing using GPT3.5.
  • Figure 2: Characteristics of PQA dataset. (a) Distribution of token counts for all examples in Pika-DS. (b) Frequency of words in each position in each section of the dataset. Long words are abbreviated (do.=does, mol.=molecule, fun.=function, we.=weight, bel.=belong).
  • Figure 3: Schematic representation of Cross- and Self-Pika architectures for the scientific PQA task. PLM = Protein Language Model (protein sequence encoder), LLM = Large Language Model. Only the adapter and cross-attention modules (both in green) are trained.
  • Figure 4: Evaluating the effectiveness of Biochem-Lite vs traditional linguistic metrics for scientific accuracy of PQA. Statistical significance is determined through a one-tailed paired t-test across three randomly seeded data subsets and model training (significance guide: $^{*}p < 0.05$, $^{**}p < 0.01$, $^{***}p < 0.001$).
  • Figure B.1: Over-representation bias in SwissProt database.
  • ...and 2 more figures