Customizing Spider Silk: Generative Models with Mechanical Property Conditioning for Protein Engineering
Neeru Dubey, Elin Karlsson, Miguel Angel Redondo, Johan Reimegård, Anna Rising, Hedvig Kjellström
TL;DR
This work tackles the challenge of linking MaSp repeat sequences to fiber mechanics by introducing a three-stage GPT-based framework. It distills ProtGPT2 into a lightweight SpiderGPT, fine-tunes on MaSp repeats, and then enables bidirectional sequence–property conditioning using a small, labeled dataset with LoRA-based parameter efficiency. The approach yields biologically plausible, property-conditioned MaSp repeats and demonstrates strong property-prediction accuracy (e.g., Pearson $r=0.8884$, cosine $=0.9827$) and novelty against broader databases via BLAST. Ablation studies show the necessity of both distillation and level-1 fine-tuning for faithful generation and reliable property estimation. Collectively, the method advances rational design of spider-silk–inspired biomaterials with tunable mechanical attributes and provides a scalable pipeline for sequence-to-function discovery.
Abstract
The remarkable mechanical properties of spider silk, including its tensile strength and extensibility, are primarily governed by the repetitive regions of the proteins that constitute the fiber, the major ampullate spidroins (MaSps). However, establishing correlations between mechanical characteristics and repeat sequences is challenging due to the intricate sequence-structure-function relationships of MaSps and the limited availability of annotated datasets. In this study, we present a novel computational framework for designing MaSp repeat sequences with customizable mechanical properties. To achieve this, we developed a lightweight GPT-based generative model by distilling the pre-trained ProtGPT2 protein language model. The distilled model was subjected to multilevel fine-tuning using curated subsets of the Spider Silkome dataset. Specifically, we adapt the model for MaSp repeat generation using 6,000 MaSp repeat sequences and further refine it with 572 repeats associated with experimentally determined fiber-level mechanical properties. Our model generates biologically plausible MaSp repeat regions tailored to specific mechanical properties while also predicting those properties for given sequences. Validation includes sequence-level analysis, assessing physicochemical attributes and expected distribution of key motifs as well as secondary structure compositions. A correlation study using BLAST on the Spider Silkome dataset and a test set of MaSp repeats with known mechanical properties further confirmed the predictive accuracy of the model. This framework advances the rational design of spider silk-inspired biomaterials, offering a versatile tool for engineering protein sequences with tailored mechanical attributes.
