Table of Contents
Fetching ...

Event Embedding of Protein Networks : Compositional Learning of Biological Function

Antonin Sulc

Abstract

In this work, we study whether enforcing strict compositional structure in sequence embeddings yields meaningful geometric organization when applied to protein-protein interaction networks. Using Event2Vec, an additive sequence embedding model, we train 64-dimensional representations on random walks from the human STRING interactome, and compare against a DeepWalk baseline based on Word2Vec, trained on the same walks. We find that compositional structure substantially improves pathway coherence (30.2$\times$ vs 2.9$\times$ above random), functional analogy accuracy (mean similarity 0.966 vs 0.650), and hierarchical pathway organization, while geometric properties such as norm--degree anticorrelation are shared with or exceeded by the non-compositional baseline. These results indicate that enforced compositionality specifically benefits relational and compositional reasoning tasks in biological networks.

Event Embedding of Protein Networks : Compositional Learning of Biological Function

Abstract

In this work, we study whether enforcing strict compositional structure in sequence embeddings yields meaningful geometric organization when applied to protein-protein interaction networks. Using Event2Vec, an additive sequence embedding model, we train 64-dimensional representations on random walks from the human STRING interactome, and compare against a DeepWalk baseline based on Word2Vec, trained on the same walks. We find that compositional structure substantially improves pathway coherence (30.2 vs 2.9 above random), functional analogy accuracy (mean similarity 0.966 vs 0.650), and hierarchical pathway organization, while geometric properties such as norm--degree anticorrelation are shared with or exceeded by the non-compositional baseline. These results indicate that enforced compositionality specifically benefits relational and compositional reasoning tasks in biological networks.

Paper Structure

This paper contains 27 sections, 1 equation, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Pathway coherence comparison. Event2Vec (left) shows higher coherence and much lower random baseline than DeepWalk (right) across all 10 pathways.
  • Figure 2: Protein arithmetic comparison. Event2Vec (left column) recovers biologically specific targets with high similarity. DeepWalk (right column) returns less specific predictions with lower similarity.
  • Figure 3: Norm--degree anticorrelation. Both methods show that high-degree proteins (hubs) receive smaller embedding norms, but DeepWalk exhibits stronger anticorrelation ($r=-0.801$) than Event2Vec ($r=-0.627$). This suggests the pattern arises from skip-gram frequency effects rather than compositional structure. Red points highlight canonical hubs (TP53, MYC, AKT1, EGFR, SRC, JUN, BRCA1, TNF). Note the different norm scales: Event2Vec's additive constraint produces smaller, more tightly distributed norms.
  • Figure 4: Hierarchical pathway organization. Cosine similarity between pathway centroids, reordered by Ward clustering. Event2Vec (left) reveals three biologically coherent super-clusters: housekeeping (Ribosome, OxPhos), nuclear/genome (DNA Repair, p53, Cell Cycle), and signaling (PI3K-AKT, MAPK/ERK, NF-$\kappa$B, Apoptosis). Negative correlations (blue) sharply delineate housekeeping pathways from signaling cascades. DeepWalk (right) shows uniformly positive similarities without clear cluster boundaries, indicating weaker functional discrimination.
  • Figure 5: Embedding drift along signaling cascades. Cumulative sums of protein embeddings ($h_t = \sum_{i=1}^{t} e_i$) are computed along three canonical pathways (EGFR signaling, PI3K-AKT-mTOR, p53 DNA damage) and projected via PCA. Event2Vec (top row) produces smooth, directional trajectories where PC1 captures 96--98% of variance, indicating that sequential addition follows a coherent linear path. DeepWalk (bottom row) shows noisier trajectories with greater PC2 spread (PC1 explains only 89--94% of variance), reflecting the absence of an additive constraint. This directly validates Event2Vec's compositional structure: the embedding of a signaling history approximates the sum of its constituent proteins.