Table of Contents
Fetching ...

A Standardized Benchmark for Multilabel Antimicrobial Peptide Classification

Sebastian Ojeda, Rafael Velasquez, Nicolás Aparicio, Juanita Puentes, Paula Cárdenas, Nicolás Andrade, Gabriel González, Sergio Rincón, Carolina Muñoz-Camargo, Pablo Arbeláez

TL;DR

The paper introduces ESCAPE, the first standardized multilabel Benchmark for Antimicrobial Peptide Classification, unifying data from 27 public repositories into a coherent five-class framework (antibacterial, antifungal, antiviral, antiparasitic, antimicrobial) with a robust non-AMP negative set. It presents ESCAPE Baseline, a dual-branch transformer that fuses sequence and 3D structural information via bidirectional cross-attention to predict multiple activities, and demonstrates clear gains over seven state-of-the-art baselines on mean Average Precision and F1-score. The work further provides a comprehensive data processing pipeline, licensing considerations, and extensive ablations and analyses, including the impact of using predicted 3D structures. Overall, ESCAPE establishes a reproducible, scalable evaluation standard that can accelerate AI-driven AMP discovery, especially for underrepresented functional classes, while highlighting the need for careful data quality and real-world validation.

Abstract

Antimicrobial peptides have emerged as promising molecules to combat antimicrobial resistance. However, fragmented datasets, inconsistent annotations, and the lack of standardized benchmarks hinder computational approaches and slow down the discovery of new candidates. To address these challenges, we present the Expanded Standardized Collection for Antimicrobial Peptide Evaluation (ESCAPE), an experimental framework integrating over 80.000 peptides from 27 validated repositories. Our dataset separates antimicrobial peptides from negative sequences and incorporates their functional annotations into a biologically coherent multilabel hierarchy, capturing activities across antibacterial, antifungal, antiviral, and antiparasitic classes. Building on ESCAPE, we propose a transformer-based model that leverages sequence and structural information to predict multiple functional activities of peptides. Our method achieves up to a 2.56% relative average improvement in mean Average Precision over the second-best method adapted for this task, establishing a new state-of-the-art multilabel peptide classification. ESCAPE provides a comprehensive and reproducible evaluation framework to advance AI-driven antimicrobial peptide research.

A Standardized Benchmark for Multilabel Antimicrobial Peptide Classification

TL;DR

The paper introduces ESCAPE, the first standardized multilabel Benchmark for Antimicrobial Peptide Classification, unifying data from 27 public repositories into a coherent five-class framework (antibacterial, antifungal, antiviral, antiparasitic, antimicrobial) with a robust non-AMP negative set. It presents ESCAPE Baseline, a dual-branch transformer that fuses sequence and 3D structural information via bidirectional cross-attention to predict multiple activities, and demonstrates clear gains over seven state-of-the-art baselines on mean Average Precision and F1-score. The work further provides a comprehensive data processing pipeline, licensing considerations, and extensive ablations and analyses, including the impact of using predicted 3D structures. Overall, ESCAPE establishes a reproducible, scalable evaluation standard that can accelerate AI-driven AMP discovery, especially for underrepresented functional classes, while highlighting the need for careful data quality and real-world validation.

Abstract

Antimicrobial peptides have emerged as promising molecules to combat antimicrobial resistance. However, fragmented datasets, inconsistent annotations, and the lack of standardized benchmarks hinder computational approaches and slow down the discovery of new candidates. To address these challenges, we present the Expanded Standardized Collection for Antimicrobial Peptide Evaluation (ESCAPE), an experimental framework integrating over 80.000 peptides from 27 validated repositories. Our dataset separates antimicrobial peptides from negative sequences and incorporates their functional annotations into a biologically coherent multilabel hierarchy, capturing activities across antibacterial, antifungal, antiviral, and antiparasitic classes. Building on ESCAPE, we propose a transformer-based model that leverages sequence and structural information to predict multiple functional activities of peptides. Our method achieves up to a 2.56% relative average improvement in mean Average Precision over the second-best method adapted for this task, establishing a new state-of-the-art multilabel peptide classification. ESCAPE provides a comprehensive and reproducible evaluation framework to advance AI-driven antimicrobial peptide research.

Paper Structure

This paper contains 29 sections, 1 equation, 5 figures, 8 tables.

Figures (5)

  • Figure 1: Timeline of AMP Discovery and Computational Advances. The rise of AMR underscores the urgent need for alternative therapies such as AMPs. While AI has shown promise in accelerating AMP discovery, progress is hindered by heterogeneous data and the absence of standardized evaluation protocols. We introduce ESCAPE to address these challenges and provide a robust foundation for future AI-driven methods.
  • Figure 2: Overview of ESCAPE Dataset Composition and Statistics. (a) Multilabel distribution of AMPs across the four functional classes in ESCAPE Dataset, (b) Sequence length distribution for AMPs and non-AMPs, and (c) Distribution of AMP and non-AMP sequences in the two folds and the test set of the dataset.
  • Figure 3: ESCAPE Baseline Architecture Overview. The model encodes each peptide using two parallel branches: the sequence module tokenizes amino acid residues. It extracts a [CLS] representation through a Transformer encoder. In contrast, the structure module processes a $224 \times 224$ distance matrix by embedding non-overlapping patches and applying a Transformer stack to produce a structural [CLS] token. A bidirectional cross-attention mechanism fuses these two representations by allowing each modality to attend to the other. The model concatenates the resulting attended CLS vectors and passes them through a linear layer to generate the final multilabel prediction vector.
  • Figure 4: Comparison of amino acid distributions in the ESCAPE dataset. (a) Amino acid distributions for AMPs and Non-AMPs, with frequency differences reflecting variations between functional and non-functional peptides. (b) Normalized amino acid distributions with respect to each class for the multilabel clasification task. Overall, the dataset maintains a consistent aminoacid composition across categories.
  • Figure 5: Comparison of model performance and number of trainable parameters across all evaluated methods. Since lighter models like the ESCAPE Baseline and AMPlify li2022amplify show the best ensemble results in the test split and heavier models (e.g., BERT-based transformers pang2022integratinglee2023amp) yield lower performance, we observe no consistent correlation between model size and predictive capability. Specifically, the ESCAPE Baseline achieves the best overall results with a fraction of the parameters used by large transformer models, suggesting that performance gains can be attained without increased model complexity.