Table of Contents
Fetching ...

Evaluating representation learning on the protein structure universe

Arian R. Jamasb, Alex Morehead, Chaitanya K. Joshi, Zuobai Zhang, Kieran Didi, Simon V. Mathis, Charles Harris, Jian Tang, Jianlin Cheng, Pietro Lio, Tom L. Blundell

TL;DR

ProteinWorkshop addresses the need for a standardized, multi-granular benchmark for protein structure representation learning using Geometric GNNs across both experimental and predicted structures. It combines modular featurisation schemes, a suite of denoising-based pretraining tasks, and a diverse set of node- and graph-level downstream tasks, evaluated on large pretraining corpora like AlphaFoldDB. Key findings show that denoising-based pretraining and richer structural detail consistently improve downstream performance, with equivariant GNNs deriving the largest gains; an ESM-2-650M model augmented with structural features can match or exceed specialized GNNs on several tasks. The open-source ProteinWorkshop framework, with storage-efficient loaders and extensible components, aims to standardize comparisons and accelerate advances in protein structure representation learning.

Abstract

We introduce ProteinWorkshop, a comprehensive benchmark suite for representation learning on protein structures with Geometric Graph Neural Networks. We consider large-scale pre-training and downstream tasks on both experimental and predicted structures to enable the systematic evaluation of the quality of the learned structural representation and their usefulness in capturing functional relationships for downstream tasks. We find that: (1) large-scale pretraining on AlphaFold structures and auxiliary tasks consistently improve the performance of both rotation-invariant and equivariant GNNs, and (2) more expressive equivariant GNNs benefit from pretraining to a greater extent compared to invariant models. We aim to establish a common ground for the machine learning and computational biology communities to rigorously compare and advance protein structure representation learning. Our open-source codebase reduces the barrier to entry for working with large protein structure datasets by providing: (1) storage-efficient dataloaders for large-scale structural databases including AlphaFoldDB and ESM Atlas, as well as (2) utilities for constructing new tasks from the entire PDB. ProteinWorkshop is available at: github.com/a-r-j/ProteinWorkshop.

Evaluating representation learning on the protein structure universe

TL;DR

ProteinWorkshop addresses the need for a standardized, multi-granular benchmark for protein structure representation learning using Geometric GNNs across both experimental and predicted structures. It combines modular featurisation schemes, a suite of denoising-based pretraining tasks, and a diverse set of node- and graph-level downstream tasks, evaluated on large pretraining corpora like AlphaFoldDB. Key findings show that denoising-based pretraining and richer structural detail consistently improve downstream performance, with equivariant GNNs deriving the largest gains; an ESM-2-650M model augmented with structural features can match or exceed specialized GNNs on several tasks. The open-source ProteinWorkshop framework, with storage-efficient loaders and extensible components, aims to standardize comparisons and accelerate advances in protein structure representation learning.

Abstract

We introduce ProteinWorkshop, a comprehensive benchmark suite for representation learning on protein structures with Geometric Graph Neural Networks. We consider large-scale pre-training and downstream tasks on both experimental and predicted structures to enable the systematic evaluation of the quality of the learned structural representation and their usefulness in capturing functional relationships for downstream tasks. We find that: (1) large-scale pretraining on AlphaFold structures and auxiliary tasks consistently improve the performance of both rotation-invariant and equivariant GNNs, and (2) more expressive equivariant GNNs benefit from pretraining to a greater extent compared to invariant models. We aim to establish a common ground for the machine learning and computational biology communities to rigorously compare and advance protein structure representation learning. Our open-source codebase reduces the barrier to entry for working with large protein structure datasets by providing: (1) storage-efficient dataloaders for large-scale structural databases including AlphaFoldDB and ESM Atlas, as well as (2) utilities for constructing new tasks from the entire PDB. ProteinWorkshop is available at: github.com/a-r-j/ProteinWorkshop.
Paper Structure (53 sections, 13 equations, 10 figures, 6 tables)

This paper contains 53 sections, 13 equations, 10 figures, 6 tables.

Figures (10)

  • Figure 1: Overview of ProteinWorkshop, a comprehensive benchmark suite for evaluating pre-training and representation learning of Geometric GNNs on large-scale protein structure data.
  • Figure 2: Ranking analysis of Gene Ontology-Biological Process (GO-BP) test performance across different encoders, feature sets, and auxiliary tasks.
  • Figure 3: Ranking analysis of Gene Ontology-Molecular Function (GO-MF) test performance across different encoders, feature sets, and auxiliary tasks.
  • Figure 4: Ranking analysis of Gene Ontology-Cellular Component (GO-CC) test performance across different encoders, feature sets, and auxiliary tasks.
  • Figure 5: Ranking analysis of Antibody Developability test performance across different encoders, feature sets, and auxiliary tasks.
  • ...and 5 more figures