Table of Contents
Fetching ...

Enhancing Protein Predictive Models via Proteins Data Augmentation: A Benchmark and New Directions

Rui Sun, Lirong Wu, Haitao Lin, Yufei Huang, Stan Z. Li

TL;DR

This work tackles the limited labeled protein data by translating data augmentation techniques from images and texts to proteins and introducing Automated Protein Augmentation (APA). It adds two semantic-level augmentations—Integrated Gradients Substitution and Back Translation Substitution—and builds an augmentation pool that APA uses to adaptively select augmentations per task and backbone. Across five protein-related tasks and three architectures, APA yields average improvements of approximately 10.55% over vanilla training, with semantic-level methods often outperforming token- and sequence-level approaches. The results highlight the value of semantic-aware augmentation and its potential to complement protein pre-training, pointing to future directions in protein structure augmentation and scaling to larger models.

Abstract

Augmentation is an effective alternative to utilize the small amount of labeled protein data. However, most of the existing work focuses on design-ing new architectures or pre-training tasks, and relatively little work has studied data augmentation for proteins. This paper extends data augmentation techniques previously used for images and texts to proteins and then benchmarks these techniques on a variety of protein-related tasks, providing the first comprehensive evaluation of protein augmentation. Furthermore, we propose two novel semantic-level protein augmentation methods, namely Integrated Gradients Substitution and Back Translation Substitution, which enable protein semantic-aware augmentation through saliency detection and biological knowledge. Finally, we integrate extended and proposed augmentations into an augmentation pool and propose a simple but effective framework, namely Automated Protein Augmentation (APA), which can adaptively select the most suitable augmentation combinations for different tasks. Extensive experiments have shown that APA enhances the performance of five protein related tasks by an average of 10.55% across three architectures compared to vanilla implementations without augmentation, highlighting its potential to make a great impact on the field.

Enhancing Protein Predictive Models via Proteins Data Augmentation: A Benchmark and New Directions

TL;DR

This work tackles the limited labeled protein data by translating data augmentation techniques from images and texts to proteins and introducing Automated Protein Augmentation (APA). It adds two semantic-level augmentations—Integrated Gradients Substitution and Back Translation Substitution—and builds an augmentation pool that APA uses to adaptively select augmentations per task and backbone. Across five protein-related tasks and three architectures, APA yields average improvements of approximately 10.55% over vanilla training, with semantic-level methods often outperforming token- and sequence-level approaches. The results highlight the value of semantic-aware augmentation and its potential to complement protein pre-training, pointing to future directions in protein structure augmentation and scaling to larger models.

Abstract

Augmentation is an effective alternative to utilize the small amount of labeled protein data. However, most of the existing work focuses on design-ing new architectures or pre-training tasks, and relatively little work has studied data augmentation for proteins. This paper extends data augmentation techniques previously used for images and texts to proteins and then benchmarks these techniques on a variety of protein-related tasks, providing the first comprehensive evaluation of protein augmentation. Furthermore, we propose two novel semantic-level protein augmentation methods, namely Integrated Gradients Substitution and Back Translation Substitution, which enable protein semantic-aware augmentation through saliency detection and biological knowledge. Finally, we integrate extended and proposed augmentations into an augmentation pool and propose a simple but effective framework, namely Automated Protein Augmentation (APA), which can adaptively select the most suitable augmentation combinations for different tasks. Extensive experiments have shown that APA enhances the performance of five protein related tasks by an average of 10.55% across three architectures compared to vanilla implementations without augmentation, highlighting its potential to make a great impact on the field.
Paper Structure (26 sections, 7 equations, 6 figures, 4 tables)

This paper contains 26 sections, 7 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Illustrations of eight protein augmentations, where the unmodified and modified amino acids are marked in blue and orange/green.
  • Figure 2: Illustration of two novel semantic-level augmentation methods. (a) Amino acids within the dashed rectangle represent the saliency regions identified by integrated gradients, with substitutions marked in orange. (b) Rectangles denote amino acids, while circles indicate nucleotides. Given that each amino acid can correspond to multiple codons, it samples the codon for each amino acid, performs reverse translation, introduces random nucleotide substitutions, and then translates the augmented nucleotide sequence back into a protein sequence.
  • Figure 3: A diagram of Automated Protein Augmentation framework. We first train the initial model with uniform sample policy to obtain the weight-shared model $\mathcal{F}(\cdot|\bar{\theta})$ in stage 1. Then, we fine-tune $\mathcal{F}(\cdot|\bar{\theta})$ on the training set with $N$ different augmentation policies in stage 2, respectively. Finally, we select the best performance policy and fine-tuned model according to the validation accuracy. An illustration of the augmentation policy is shown on the right. Each policy consists of $M$ sub-policies, and each sub-policy has two augmentation transformations sequentially, with two parameters: calling probability $p$ and augmentation magnitude $\lambda$. The right part shows the process of applying a sub-policy to a protein sequence, where the gray vacant rectangle indicates that the transformation is not applied, i.e., $p=0$.
  • Figure 4: Heatmap visualization of Integrated Gradients attributions for 4 proteins in the same batch at epoch 0, 10, 20, and 30 using the ESM-2-35M model on the subcellular localization task. Lighter shades represent regions with higher contributions to the final predictions.
  • Figure 5: Comparison of training loss and test accuracy between the vanilla LSTM model and the model with APA over 50 epochs on the Subloc task. The curves demonstrates the advantages of the APA-enhanced model in terms of convergence speed and test accuracy.
  • ...and 1 more figures