Table of Contents
Fetching ...

Learning the rules of peptide self-assembly through data mining with large language models

Zhenze Yang, Sarah K. Yorke, Tuomas P. J. Knowles, Markus J. Buehler

TL;DR

This work curates a peptide assembly database through a combination of manual processing by human experts and large language model–assisted literature mining, and fine-tune a GPT model for peptide literature mining with the developed dataset.

Abstract

Peptides are ubiquitous and important biologically derived molecules, that have been found to self-assemble to form a wide array of structures. Extensive research has explored the impacts of both internal chemical composition and external environmental stimuli on the self-assembly behaviour of these systems. However, there is yet to be a systematic study that gathers this rich literature data and collectively examines these experimental factors to provide a global picture of the fundamental rules that govern protein self-assembly behavior. In this work, we curate a peptide assembly database through a combination of manual processing by human experts and literature mining facilitated by a large language model. As a result, we collect more than 1,000 experimental data entries with information about peptide sequence, experimental conditions and corresponding self-assembly phases. Utilizing the collected data, ML models are trained and evaluated, demonstrating excellent accuracy (>80\%) and efficiency in peptide assembly phase classification. Moreover, we fine-tune our GPT model for peptide literature mining with the developed dataset, which exhibits markedly superior performance in extracting information from academic publications relative to the pre-trained model. We find that this workflow can substantially improve efficiency when exploring potential self-assembling peptide candidates, through guiding experimental work, while also deepening our understanding of the mechanisms governing peptide self-assembly. In doing so, novel structures can be accessed for a range of applications including sensing, catalysis and biomaterials.

Learning the rules of peptide self-assembly through data mining with large language models

TL;DR

This work curates a peptide assembly database through a combination of manual processing by human experts and large language model–assisted literature mining, and fine-tune a GPT model for peptide literature mining with the developed dataset.

Abstract

Peptides are ubiquitous and important biologically derived molecules, that have been found to self-assemble to form a wide array of structures. Extensive research has explored the impacts of both internal chemical composition and external environmental stimuli on the self-assembly behaviour of these systems. However, there is yet to be a systematic study that gathers this rich literature data and collectively examines these experimental factors to provide a global picture of the fundamental rules that govern protein self-assembly behavior. In this work, we curate a peptide assembly database through a combination of manual processing by human experts and literature mining facilitated by a large language model. As a result, we collect more than 1,000 experimental data entries with information about peptide sequence, experimental conditions and corresponding self-assembly phases. Utilizing the collected data, ML models are trained and evaluated, demonstrating excellent accuracy (>80\%) and efficiency in peptide assembly phase classification. Moreover, we fine-tune our GPT model for peptide literature mining with the developed dataset, which exhibits markedly superior performance in extracting information from academic publications relative to the pre-trained model. We find that this workflow can substantially improve efficiency when exploring potential self-assembling peptide candidates, through guiding experimental work, while also deepening our understanding of the mechanisms governing peptide self-assembly. In doing so, novel structures can be accessed for a range of applications including sensing, catalysis and biomaterials.

Paper Structure

This paper contains 16 sections, 4 figures.

Figures (4)

  • Figure 1: Overview of the workflow reported in this work. We first collect PDF files of publications from different journal presses and scientific databases based on the previous polypeptide database SAPdb MATHUR2021104391. Here, we extract not only the peptide sequence but also experimental conditions from those previous publications and learn their impacts on the self-assembly phase of polypeptides. The selected publications are read and processed by human experts to curate the database, which is further utilized to train ML algorithms for predicting self-assembled structure from peptide sequences and experimental conditions. We also use the manually curated database to fine-tune a LLM to be specialized in literature mining of polypeptide publications and compare the performance with the same LLM without fine-tuning. The model can be used to extract information for new publications, significantly reducing the time required compared to manual methods employed by human experts. Moreover, by incorporating this new data, we can augment our existing database, thereby further refining and enhancing our ML model for phase prediction.
  • Figure 2: Dataset statistics. (a) Histogram of 9 categorical features including "peptide sequence", "N-terminal modification", "C-terminal modification" , "Non-terminal modification", "category: peptide/conjugate/mixture", "conjugate partner", "thermal process: heating/cooling"), "linear/cyclic" and "solution" (the solution environment of the peptide). (b) Histogram of 4 numerical features including "solvent ratio" (the ratio of solvent in the solution), "concentration" (concentration of peptides), "pH" (pH of solution environment), "temperature" (ambient temperature of experiments). All values are normalized between 0 and 1 based on min and max values. (c) Histogram of assembled phases. (d) Occurrence of dipeptide data from academic literature. (e) Occurrence of tripeptide data from academic literature.
  • Figure 3: ML algorithms for phase prediction. (a) Performance comparison of 4 classical ML classifiers; RF, MLP, GPC and KNC. $F_1$, precision and recall scores are utilized as metrics for evaluation. (b) Comparison of models' performances on the imbalanced dataset (comparison between "original" and "oversampling" case) and evaluation of generalization capacity (comparison between "original" and "generalization" case). (c) Confusion matrix of RF model for 8 different phases. (d) SHAP plot NIPS2017_7062 for feature importance analysis.
  • Figure 4: LLM-assisted literature mining. (a) Overall workflow of LLM-assisted literature mining: we first extract texts within the experimental section of each publication by searching for the section headings such as "Material(s)". "Method(s)" and "Experimental section(s)". Afterwards, relevant paragraphs are collected based on key words related to target information. For instance, for self-assembly phase, we deliberately search for the name of phases including "fiber", "sphere" and others. With selected paragraphs after preprocessing, we employ both pretrained and fine-tuned (with our manual dataset) to perform "Named Entity Extraction" which extracts 13 target features the from text corpus. (b) Histogram of length of texts after preprocessing. (c) Performance comparison of original pretrained GPT model and fine-tuned model for categorical feature extraction. (d) Performance comparison of original pretrained GPT model and fine-tuned model for numerical feature extraction.