Table of Contents
Fetching ...

BioBlobs: Differentiable Graph Partitioning for Protein Representation Learning

Xin Wang, Carlos Oliver

TL;DR

BioBlobs introduces a differentiable graph-partitioning module that learns cohesively sized, non-overlapping protein substructures called blobs, which are encoded via a discrete codebook and fused with global representations through a blob-attention mechanism. The approach integrates a neural partitioner, a VQ-VAE–style codebook, and a global–blob attention module on top of GVP-based encoders to capture function-relevant modularity in protein structures. Across GO, EC, and SCOP benchmarks, BioBlobs achieves state-of-the-art or competitive results and provides interpretable visualizations showing blobs aligning with known functional motifs. This framework offers scalable, end-to-end trainable representations that reveal mechanistic insights into protein function and open avenues for systematic analysis of learned substructure vocabularies.

Abstract

Protein function is driven by coherent substructures which vary in size and topology, yet current protein representation learning models (PRL) distort these signals by relying on rigid substructures such as k-hop and fixed radius neighbourhoods. We introduce BioBlobs, a plug-and-play, fully differentiable module that represents proteins by dynamically partitioning structures into flexibly-sized, non-overlapping substructures ("blobs"). The resulting blobs are quantized into a shared and interpretable codebook, yielding a discrete vocabulary of function-relevant protein substructures used to compute protein embeddings. We show that BioBlobs representations improve the performance of widely used protein encoders such as GVP-GNN across various PRL tasks. Our approach highlights the value of architectures that directly capture function-relevant protein substructures, enabling both improved predictive performance and mechanistic insight into protein function.

BioBlobs: Differentiable Graph Partitioning for Protein Representation Learning

TL;DR

BioBlobs introduces a differentiable graph-partitioning module that learns cohesively sized, non-overlapping protein substructures called blobs, which are encoded via a discrete codebook and fused with global representations through a blob-attention mechanism. The approach integrates a neural partitioner, a VQ-VAE–style codebook, and a global–blob attention module on top of GVP-based encoders to capture function-relevant modularity in protein structures. Across GO, EC, and SCOP benchmarks, BioBlobs achieves state-of-the-art or competitive results and provides interpretable visualizations showing blobs aligning with known functional motifs. This framework offers scalable, end-to-end trainable representations that reveal mechanistic insights into protein function and open avenues for systematic analysis of learned substructure vocabularies.

Abstract

Protein function is driven by coherent substructures which vary in size and topology, yet current protein representation learning models (PRL) distort these signals by relying on rigid substructures such as k-hop and fixed radius neighbourhoods. We introduce BioBlobs, a plug-and-play, fully differentiable module that represents proteins by dynamically partitioning structures into flexibly-sized, non-overlapping substructures ("blobs"). The resulting blobs are quantized into a shared and interpretable codebook, yielding a discrete vocabulary of function-relevant protein substructures used to compute protein embeddings. We show that BioBlobs representations improve the performance of widely used protein encoders such as GVP-GNN across various PRL tasks. Our approach highlights the value of architectures that directly capture function-relevant protein substructures, enabling both improved predictive performance and mechanistic insight into protein function.

Paper Structure

This paper contains 29 sections, 16 equations, 5 figures, 2 tables, 1 algorithm.

Figures (5)

  • Figure 1: Overview of the $\textsc{BioBlobs}$ pipeline. The framework consists of four main components: a protein encoder, a neural partitioner, a blob codebook, and a global-blob attention fusion module. The GVP encoder first processes the protein graph and produces residue embeddings. (a) Neural Blob Partitioner. A seed residue is first selected with Gumbel–Softmax. Its $k$-hop neighborhood is then identified to restrict the candidate pool. Finally, a blob expander scores the candidates and assigns residues to form cohesive local substructures. (b) Blob Codebook. The resulting blob embeddings are quantized into a discrete codebook that captures frequent and label-relevant protein substructures. (c) Global–Blob Attention Fusion. The quantized blob embeddings are integrated with the global feature using a multi-key attention mechanism. This produces both a fused representation for classification and an interpretable importance score distribution over blobs.
  • Figure 2: UMAP projection of the blob and code embeddings for the EC(structure) test set, where code embeddings are marked by their indices. Example $\textsc{BioBlobs}$ partitions are shown on two sides. Colored regions represent distinct blobs and their code index. Each protein is annotated with its PDB ID, true and predicted EC numbers, and the importance score $\pi_t$ for each blob.
  • Figure 3: Neural partitioner case study: tuning maximum cluster size $S$ and number of clusters $T$.
  • Figure 4: UMAP projection of code and cluster embeddings for EC dataset, random split
  • Figure 5: UMAP projection of code and cluster embeddings for EC dataset, structure split