Simplifying complex machine learning by linearly separable network embedding spaces

Alexandros Xenos; Noel-Malod Dognin; Natasa Przulj

Simplifying complex machine learning by linearly separable network embedding spaces

Alexandros Xenos, Noel-Malod Dognin, Natasa Przulj

TL;DR

It is shown that the more homophilic the network representation, the more linearly separable the corresponding network embedding space, yielding better downstream analysis results, and introduces novel graphlet-based methods enabling embedding of networks into more linearly separable spaces, allowing for their better mining.

Abstract

Low-dimensional embeddings are a cornerstone in the modelling and analysis of complex networks. However, most existing approaches for mining network embedding spaces rely on computationally intensive machine learning systems to facilitate downstream tasks. In the field of NLP, word embedding spaces capture semantic relationships \textit{linearly}, allowing for information retrieval using \textit{simple linear operations} on word embedding vectors. Here, we demonstrate that there are structural properties of network data that yields this linearity. We show that the more homophilic the network representation, the more linearly separable the corresponding network embedding space, yielding better downstream analysis results. Hence, we introduce novel graphlet-based methods enabling embedding of networks into more linearly separable spaces, allowing for their better mining. Our fundamental insights into the structure of network data that enable their \textit{\textbf{linear}} mining and exploitation enable the ML community to build upon, towards efficiently and explainably mining of the complex network data.

Simplifying complex machine learning by linearly separable network embedding spaces

TL;DR

Abstract

Paper Structure (27 sections, 15 equations, 8 figures, 14 tables)

This paper contains 27 sections, 15 equations, 8 figures, 14 tables.

Introduction
Contribution
Methods
Datasets
Biological multi-labeled networks:
Gene annotations:
Single-labeled networks:
Network embeddings
Graphlets and Graphlet Adjacency
Novel graphlet-based network matrix representations
Homophily measures
Non-negative matrix tri-factorization based embeddings
Linear separability of the embedding space
Random partition graph model
Downstream analysis tasks
...and 12 more sections

Figures (8)

Figure 1: Graphlet-based network matrix representations lead to more hopophilic representations. In Panel A, for each graphlet (x-axis) and method (color-coded), the line plot shows the average, over the six biological multi-labeled networks, node homophily index and the standard deviation. Panel B shows the same, but on average over the seven single-labeled networks. In panel C, for each graphlet (x-axis) and method (color-coded) the line plot shows the average Geometric Separability Index (y-axis), over the six biological multi-labeled networks, along with the standard deviation. Panel D shows the same, but on average over the seven single-labeled networks.
Figure 1: The nine 2- to 4-node graphlets and their 15 orbits. Within each graphlet, $G_i, i \in \{0,\ldots,8\}$, nodes belonging to the same orbit are of the same shade and are numbered from 0 to 14. The eleven non-redundant orbits, whose counts cannot be derived from the counts of the other orbits, are highlighted in red.Yaveroglu2014
Figure 2: Graphlet-based embeddings lead to better results in downstream analysis tasks. Panel A presents the results of the functional module discovery in gene embedding spaces of the biological multi-labeled networks. In particular, for each graphlet (x-axis) and for each method (color-coded) the line plot shows the average percentage, over the six molecular networks, of annotated genes in the clusters that have at least one Reactome Pathway (RP) term enriched in their clusters, along with the standard deviation (y-axis). Panel B presents the results of the label prediction based on the cosine similarity in the embedding space of the single-labeled networks. For each graphlet (x-axis) and for each method (color-coded), the line plot shows the average weighted AUROC score (y-axis) over the seven single-labeled networks and the standard deviation.
Figure 2: Graphlet-based network representations are more homophilic. In the left panel, for each graphlet (x-axis) and for each method (color-coded), the line plot shows on average over the six biological multi-labeled networks the edge homophily index and the standard deviation. The right panel shows the same, but for the seven single-labeled networks.
Figure 3: DeepGraphlets node classification performances in the single-labeled networks. In the left panel, for each of the nine graphlets (x-axis) and for each classifier (color-coded), the line plot shows for the corresponding nine DeepGraphlets based embeddings the weighted node classification F1-score averaged over the three fully linear single-labeled networks (Cora, CS Co-author and Wikipedia CS) and the standard deviation. The right panel shows the same, but for the four non-linearly separable single-labeled networks (Cameleon, Squirrel, CiteSeer and USA air-traffic).
...and 3 more figures

Simplifying complex machine learning by linearly separable network embedding spaces

TL;DR

Abstract

Simplifying complex machine learning by linearly separable network embedding spaces

Authors

TL;DR

Abstract

Table of Contents

Figures (8)