Table of Contents
Fetching ...

Morphology-Aware Peptide Discovery via Masked Conditional Generative Modeling

Nuno Costa, Julija Zavadlav

TL;DR

PepMorph is introduced, an end-to-end peptide discovery pipeline that generates novel sequences that are not only prone to aggregate but whose self-assembly is steered toward fibrillar or spherical morphologies by conditioning on isolated peptide descriptors that serve as morphology proxies.

Abstract

Peptide self-assembly prediction offers a powerful bottom-up strategy for designing biocompatible, low-toxicity materials for large-scale synthesis in a broad range of biomedical and energy applications. However, screening the vast sequence space for categorization of aggregate morphology remains intractable. We introduce PepMorph, an end-to-end peptide discovery pipeline that generates novel sequences that are not only prone to aggregate but whose self-assembly is steered toward fibrillar or spherical morphologies by conditioning on isolated peptide descriptors that serve as morphology proxies. To this end, we compiled a new dataset by leveraging existing aggregation propensity datasets and extracting geometric and physicochemical descriptors. This dataset is then used to train a Transformer-based Conditional Variational Autoencoder with a masking mechanism, which generates novel peptides under arbitrary conditioning. After filtering to ensure design specifications and validation of generated sequences through coarse-grained molecular dynamics (CG-MD) simulations, PepMorph yielded 83% success rate under our CG-MD validation protocol and morphology criterion for the targeted class, showcasing its promise as a framework for application-driven peptide discovery.

Morphology-Aware Peptide Discovery via Masked Conditional Generative Modeling

TL;DR

PepMorph is introduced, an end-to-end peptide discovery pipeline that generates novel sequences that are not only prone to aggregate but whose self-assembly is steered toward fibrillar or spherical morphologies by conditioning on isolated peptide descriptors that serve as morphology proxies.

Abstract

Peptide self-assembly prediction offers a powerful bottom-up strategy for designing biocompatible, low-toxicity materials for large-scale synthesis in a broad range of biomedical and energy applications. However, screening the vast sequence space for categorization of aggregate morphology remains intractable. We introduce PepMorph, an end-to-end peptide discovery pipeline that generates novel sequences that are not only prone to aggregate but whose self-assembly is steered toward fibrillar or spherical morphologies by conditioning on isolated peptide descriptors that serve as morphology proxies. To this end, we compiled a new dataset by leveraging existing aggregation propensity datasets and extracting geometric and physicochemical descriptors. This dataset is then used to train a Transformer-based Conditional Variational Autoencoder with a masking mechanism, which generates novel peptides under arbitrary conditioning. After filtering to ensure design specifications and validation of generated sequences through coarse-grained molecular dynamics (CG-MD) simulations, PepMorph yielded 83% success rate under our CG-MD validation protocol and morphology criterion for the targeted class, showcasing its promise as a framework for application-driven peptide discovery.

Paper Structure

This paper contains 4 sections, 15 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: PepMorph dataset (a) Data curation and feature-extraction workflow: we merge three sources: Wang et al.wangDeepLearningEmpowers2023a ($\sim$62k peptides, 5-10 amino acids (aa)), Teijlingen & Tuttle vanteijlingenTripeptidesTwoStepActive2021a ($\sim$60k, 3-8 aa), and a set of $\sim$39k random peptides (retained set from successful PEP-FOLD runs, 5-10 aa). After deduplication, self-assembly (SA) labels are assigned, and peptide conformations are predicted with PEP-FOLD for Wang et al. and random peptides to derive biophysical descriptors ($\beta$-strand assignment, net charge and hydrophobic moment). The resulting PepMorph corpus contains 161k unique peptides spanning 3-10 aa with aggregation-propensity (AP) values and SA/no-SA labels, as well the calculated peptide--level descriptors. Univariate summaries of the PepMorph dataset are shown, specifically of AP density (b), assembly vs no assembly (c), hydrophobic moment (d), peptide length (e), presence of $\beta$-strand assignment (f) and net-charge (g). Regions regarding no-assembly and assembly are highlighted in (b), and condition regions used when targeting specific morphologies are highlighted in the remaining summaries (d-g).
  • Figure 2: PepMorph model and generation validation. (a) Schematic of the Transformer-based Conditional Variational Autoencoder with the masking mechanism: a descriptor vector $c$ and mask $m$ are summarized into a condition summary that conditions both the latent prior and the autoregressive Transformer decoder, enabling generation under arbitrary subsets of constraints. (b) Amino acid frequency in generated peptides closely follows the training distribution for both common and rare condition sets. (c) Novelty relative to the training set, quantified by the nearest-neighbour normalized edit distance (NED), showing the empirical cumulative fraction of conditioning targets whose mean NED to the closest training peptide is $\leq x$, which is similar in common vs. rare condition sets. (d) Condition-matching as a function of the number of conditioned descriptors $k$: the fraction of peptides meeting their targets declines as constraints tighten. (e) Similarity via Needleman-Wunsch percent identity of generated sequences (points) to the training set ($\mathrm{Sim}_{\text{train}}$) vs. generated sequences within the same common conditions set ($\mathrm{Sim}_{\text{gen}}^{\text{within}}$), color coded by the number of conditions $k$; values remain near low-identity baselines.
  • Figure 3: PepMorph pipeline for spherical vs. fibrillar aggregate generation: screening and Molecular Dynamics (MD) visualization. (a) Screening funnel for the two targeted morphologies (left values refer to spheres, right values to fibers). Amino acid occurrence across the funnel for (b) spheres and (c) fibers. For spheres, the validated set collapses to a narrow alphabet dominated by F/I/L/V, whereas fibers remain compositionally closer to the pre-filter pool. (d) Distributions of the RMOI for all MD runs (3 per selected peptide, leading to 90 runs); dashed lines mark success thresholds (spheres $\geq\!0.75$, fibers $\leq\!0.35$), with individual peptide points overlaid. Representative MD snapshots are circled in (d) and shown in (e), with the corresponding sequence and RMOI above each panel illustrating the progression from fibrillar to spherical aggregates.
  • Figure 4: Latent map of morphology conditioning. A UMAP is fit on the conditional prior centers with encoder posteriors of generated peptides projected into the same space (dots for spheres, stars for fibers). (a) Condition center embeddings are colored by conditioned sequence length, revealing an ordered length gradient along the sphere branch and a single, compact, and distinct fiber island; notably, fiber targets of length 7 cluster closer to the sphere branch, consistent with over-constrained fiber conditions claims. (b) Same embedding with MD-simulated peptides highlighted according to target morphology and outcome (success/failure under the RMOI criterion); stars mark the sphere/fiber prior centers.