BioBlobs: Differentiable Graph Partitioning for Protein Representation Learning
Xin Wang, Carlos Oliver
TL;DR
BioBlobs introduces a differentiable graph-partitioning module that learns cohesively sized, non-overlapping protein substructures called blobs, which are encoded via a discrete codebook and fused with global representations through a blob-attention mechanism. The approach integrates a neural partitioner, a VQ-VAE–style codebook, and a global–blob attention module on top of GVP-based encoders to capture function-relevant modularity in protein structures. Across GO, EC, and SCOP benchmarks, BioBlobs achieves state-of-the-art or competitive results and provides interpretable visualizations showing blobs aligning with known functional motifs. This framework offers scalable, end-to-end trainable representations that reveal mechanistic insights into protein function and open avenues for systematic analysis of learned substructure vocabularies.
Abstract
Protein function is driven by coherent substructures which vary in size and topology, yet current protein representation learning models (PRL) distort these signals by relying on rigid substructures such as k-hop and fixed radius neighbourhoods. We introduce BioBlobs, a plug-and-play, fully differentiable module that represents proteins by dynamically partitioning structures into flexibly-sized, non-overlapping substructures ("blobs"). The resulting blobs are quantized into a shared and interpretable codebook, yielding a discrete vocabulary of function-relevant protein substructures used to compute protein embeddings. We show that BioBlobs representations improve the performance of widely used protein encoders such as GVP-GNN across various PRL tasks. Our approach highlights the value of architectures that directly capture function-relevant protein substructures, enabling both improved predictive performance and mechanistic insight into protein function.
