Table of Contents
Fetching ...

Customizing Spider Silk: Generative Models with Mechanical Property Conditioning for Protein Engineering

Neeru Dubey, Elin Karlsson, Miguel Angel Redondo, Johan Reimegård, Anna Rising, Hedvig Kjellström

TL;DR

This work tackles the challenge of linking MaSp repeat sequences to fiber mechanics by introducing a three-stage GPT-based framework. It distills ProtGPT2 into a lightweight SpiderGPT, fine-tunes on MaSp repeats, and then enables bidirectional sequence–property conditioning using a small, labeled dataset with LoRA-based parameter efficiency. The approach yields biologically plausible, property-conditioned MaSp repeats and demonstrates strong property-prediction accuracy (e.g., Pearson $r=0.8884$, cosine $=0.9827$) and novelty against broader databases via BLAST. Ablation studies show the necessity of both distillation and level-1 fine-tuning for faithful generation and reliable property estimation. Collectively, the method advances rational design of spider-silk–inspired biomaterials with tunable mechanical attributes and provides a scalable pipeline for sequence-to-function discovery.

Abstract

The remarkable mechanical properties of spider silk, including its tensile strength and extensibility, are primarily governed by the repetitive regions of the proteins that constitute the fiber, the major ampullate spidroins (MaSps). However, establishing correlations between mechanical characteristics and repeat sequences is challenging due to the intricate sequence-structure-function relationships of MaSps and the limited availability of annotated datasets. In this study, we present a novel computational framework for designing MaSp repeat sequences with customizable mechanical properties. To achieve this, we developed a lightweight GPT-based generative model by distilling the pre-trained ProtGPT2 protein language model. The distilled model was subjected to multilevel fine-tuning using curated subsets of the Spider Silkome dataset. Specifically, we adapt the model for MaSp repeat generation using 6,000 MaSp repeat sequences and further refine it with 572 repeats associated with experimentally determined fiber-level mechanical properties. Our model generates biologically plausible MaSp repeat regions tailored to specific mechanical properties while also predicting those properties for given sequences. Validation includes sequence-level analysis, assessing physicochemical attributes and expected distribution of key motifs as well as secondary structure compositions. A correlation study using BLAST on the Spider Silkome dataset and a test set of MaSp repeats with known mechanical properties further confirmed the predictive accuracy of the model. This framework advances the rational design of spider silk-inspired biomaterials, offering a versatile tool for engineering protein sequences with tailored mechanical attributes.

Customizing Spider Silk: Generative Models with Mechanical Property Conditioning for Protein Engineering

TL;DR

This work tackles the challenge of linking MaSp repeat sequences to fiber mechanics by introducing a three-stage GPT-based framework. It distills ProtGPT2 into a lightweight SpiderGPT, fine-tunes on MaSp repeats, and then enables bidirectional sequence–property conditioning using a small, labeled dataset with LoRA-based parameter efficiency. The approach yields biologically plausible, property-conditioned MaSp repeats and demonstrates strong property-prediction accuracy (e.g., Pearson , cosine ) and novelty against broader databases via BLAST. Ablation studies show the necessity of both distillation and level-1 fine-tuning for faithful generation and reliable property estimation. Collectively, the method advances rational design of spider-silk–inspired biomaterials with tunable mechanical attributes and provides a scalable pipeline for sequence-to-function discovery.

Abstract

The remarkable mechanical properties of spider silk, including its tensile strength and extensibility, are primarily governed by the repetitive regions of the proteins that constitute the fiber, the major ampullate spidroins (MaSps). However, establishing correlations between mechanical characteristics and repeat sequences is challenging due to the intricate sequence-structure-function relationships of MaSps and the limited availability of annotated datasets. In this study, we present a novel computational framework for designing MaSp repeat sequences with customizable mechanical properties. To achieve this, we developed a lightweight GPT-based generative model by distilling the pre-trained ProtGPT2 protein language model. The distilled model was subjected to multilevel fine-tuning using curated subsets of the Spider Silkome dataset. Specifically, we adapt the model for MaSp repeat generation using 6,000 MaSp repeat sequences and further refine it with 572 repeats associated with experimentally determined fiber-level mechanical properties. Our model generates biologically plausible MaSp repeat regions tailored to specific mechanical properties while also predicting those properties for given sequences. Validation includes sequence-level analysis, assessing physicochemical attributes and expected distribution of key motifs as well as secondary structure compositions. A correlation study using BLAST on the Spider Silkome dataset and a test set of MaSp repeats with known mechanical properties further confirmed the predictive accuracy of the model. This framework advances the rational design of spider silk-inspired biomaterials, offering a versatile tool for engineering protein sequences with tailored mechanical attributes.

Paper Structure

This paper contains 30 sections, 4 equations, 9 figures, 4 tables.

Figures (9)

  • Figure 1: Hierarchical representation of the spider dragline silk fiber architecture, highlighting the schematic image of MaSp showing various sequential elements.
  • Figure 2: Illustration of the proposed methodology organized into three levels. Stage 1 involves training a distilled ProtGPT2 model using spider protein sequences from UniProtKB 10.1093/nar/gkae1010. In Stage 2, the model is fine-tuned on the repeat regions of MaSp to adapt to their unique patterns. Finally, Stage 3 further fine-tunes the model to capture correlations between MaSp repeats and their mechanical properties.
  • Figure 3: This figure presents a comprehensive comparison of nine key physicochemical and structural properties between naturally occurring (Natural) and computationally generated (Generated) proteins. The analysis includes distributions of KL divergence, Hamming distance, molecular weight, isoelectric point, instability index, sequence length, motif patterns, secondary structure elements, and amino acid composition. The plots demonstrate the degree of similarity between generated proteins and their natural counterparts across multiple biologically relevant parameters, providing insights into the fidelity of the protein design process.
  • Figure 4: The heatmap illustrates the correlation between sequential features and the mechanical properties of the generated sequences. While some weak correlations are present, the overall low values suggest the need for a more advanced approach to better capture sequence-property relationships.
  • Figure 5: Comparison between original (natural) and generated sequences on the test set in terms of various matrices. (1) Sequence properties: sequence length, molecular weight, instability index, Isoelectric point. (2) Average amino acid frequency distribution grouped by physicochemical properties. The property consistency highlights the validity of the generated sequences in terms of their structural and biochemical features. Furthermore, the consistent alignment demonstrates the model’s ability to effectively capture the key characteristics and properties of MaSp.
  • ...and 4 more figures