Table of Contents
Fetching ...

Protein Secondary Structure Prediction Using Transformers

Manzi Kevin Maxime

TL;DR

The paper tackles predicting protein secondary structures from amino acid sequences using a transformer with self-attention to capture long-range residue dependencies. It employs sliding-window augmentation on CB513 to expand the training data and demonstrates strong generalization to variable-length sequences, achieving ~88% validation accuracy. Key contributions include showing robust PSSP performance with a transformer on a small, augmented dataset and outlining potential improvements through pretrained embeddings and interpretability methods. This work reinforces transformer-based approaches as effective for sequence-based protein structure prediction and informs directions for integrating large-scale protein representations and external benchmarks.

Abstract

Predicting protein secondary structures such as alpha helices, beta sheets, and coils from amino acid sequences is essential for understanding protein function. This work presents a transformer-based model that applies attention mechanisms to protein sequence data to predict structural motifs. A sliding-window data augmentation technique is used on the CB513 dataset to expand the training samples. The transformer shows strong ability to generalize across variable-length sequences while effectively capturing both local and long-range residue interactions.

Protein Secondary Structure Prediction Using Transformers

TL;DR

The paper tackles predicting protein secondary structures from amino acid sequences using a transformer with self-attention to capture long-range residue dependencies. It employs sliding-window augmentation on CB513 to expand the training data and demonstrates strong generalization to variable-length sequences, achieving ~88% validation accuracy. Key contributions include showing robust PSSP performance with a transformer on a small, augmented dataset and outlining potential improvements through pretrained embeddings and interpretability methods. This work reinforces transformer-based approaches as effective for sequence-based protein structure prediction and informs directions for integrating large-scale protein representations and external benchmarks.

Abstract

Predicting protein secondary structures such as alpha helices, beta sheets, and coils from amino acid sequences is essential for understanding protein function. This work presents a transformer-based model that applies attention mechanisms to protein sequence data to predict structural motifs. A sliding-window data augmentation technique is used on the CB513 dataset to expand the training samples. The transformer shows strong ability to generalize across variable-length sequences while effectively capturing both local and long-range residue interactions.

Paper Structure

This paper contains 10 sections, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Distribution of sequence lengths for grouped proteins in the dataset.
  • Figure 2: Distribution of amino acid residues in the augmented protein sequence dataset.
  • Figure 3: Distribution of secondary structure elements (H, E, C), showing H and C as most common.
  • Figure 4: Overview of the transformer-based model architecture.
  • Figure 5: Training and validation accuracy curves.
  • ...and 2 more figures