Table of Contents
Fetching ...

Greater than the Sum of Its Parts: Building Substructure into Protein Encoding Models

Robert Calef, Arthur Liang, Manolis Kellis, Marinka Zitnik

TL;DR

Across state-of-the-art sequence- and structure-based models, substructure-tuning improves function prediction, yields more consistent representations of substructure types never observed during tuning, and shows that substructural supervision provides information that is complementary to global structure inputs.

Abstract

Protein representation learning has advanced rapidly with the scale-up of sequence and structure supervision, but most models still encode proteins either as per-residue token sequences or as single global embeddings. This overlooks a defining property of protein organization: proteins are built from recurrent, evolutionarily conserved substructures that concentrate biochemical activity and mediate core molecular functions. Although substructures such as domains and functional sites are systematically cataloged, they are rarely used as training signals or representation units in protein models. We introduce Magneton, an environment for developing substructure-aware protein models. Magneton provides (1) a dataset of 530,601 proteins annotated with over 1.7 million substructures spanning 13,075 types, (2) a training framework for incorporating substructures into existing protein models, and (3) a benchmark suite of 13 tasks probing representations at the residue, substructural, and protein levels. Using Magneton, we develop substructure-tuning, a supervised fine-tuning method that distills substructural knowledge into pretrained protein models. Across state-of-the-art sequence- and structure-based models, substructure-tuning improves function prediction, yields more consistent representations of substructure types never observed during tuning, and shows that substructural supervision provides information that is complementary to global structure inputs. The Magneton environment, datasets, and substructure-tuned models are all openly available (https://github.com/rcalef/magneton/).

Greater than the Sum of Its Parts: Building Substructure into Protein Encoding Models

TL;DR

Across state-of-the-art sequence- and structure-based models, substructure-tuning improves function prediction, yields more consistent representations of substructure types never observed during tuning, and shows that substructural supervision provides information that is complementary to global structure inputs.

Abstract

Protein representation learning has advanced rapidly with the scale-up of sequence and structure supervision, but most models still encode proteins either as per-residue token sequences or as single global embeddings. This overlooks a defining property of protein organization: proteins are built from recurrent, evolutionarily conserved substructures that concentrate biochemical activity and mediate core molecular functions. Although substructures such as domains and functional sites are systematically cataloged, they are rarely used as training signals or representation units in protein models. We introduce Magneton, an environment for developing substructure-aware protein models. Magneton provides (1) a dataset of 530,601 proteins annotated with over 1.7 million substructures spanning 13,075 types, (2) a training framework for incorporating substructures into existing protein models, and (3) a benchmark suite of 13 tasks probing representations at the residue, substructural, and protein levels. Using Magneton, we develop substructure-tuning, a supervised fine-tuning method that distills substructural knowledge into pretrained protein models. Across state-of-the-art sequence- and structure-based models, substructure-tuning improves function prediction, yields more consistent representations of substructure types never observed during tuning, and shows that substructural supervision provides information that is complementary to global structure inputs. The Magneton environment, datasets, and substructure-tuned models are all openly available (https://github.com/rcalef/magneton/).

Paper Structure

This paper contains 32 sections, 1 equation, 20 figures, 21 tables.

Figures (20)

  • Figure 1: Overview of protein structure and the Magneton environment. (A) Proteins are built from modular substructures that assemble into full structures. (B) Magneton leverages decades of substructure research to provide an environment for developing and evaluating substructure-aware models.
  • Figure 2: Overview of using Magneton for substructure-tuning. Given a pre-trained protein model, substructure-tuning first pools residue-level embeddings to create substructure representations, which are then used for supervised finetuning via substructure type-specific classifier heads.
  • Figure 3: (A) Domain classification uses local cues. Even within proteins containing multiple domains, classification accuracy remains high for all contained domains. Labels within bars show the number of test set proteins containing that number of domains. (B) Domain classification accuracy as a function of training set representation. Results shown for ESM-C 300M.
  • Figure A.1: Distribution of substructure lengths by type. Here, length is defined as the total number of residues contained within the substructure, regardless of whether they are contiguous within the sequence.
  • Figure A.2: Distribution of substructure lengths by type. Same as Figure \ref{['appendix:fig_length_dist']}, but filtered to remove outliers (greater than 99th percentile of length within that type).
  • ...and 15 more figures