An embedding-based distance for temporal graphs

Lorenzo Dall'Amico; Alain Barrat; Ciro Cattuto

An embedding-based distance for temporal graphs

Lorenzo Dall'Amico, Alain Barrat, Ciro Cattuto

TL;DR

The authors introduce a notion of distance to compare time-resolved interaction patterns, based on their representations as temporal graphs, which is well-defined for pairs of temporal graphs with different numbers of nodes and different time spans.

Abstract

Temporal graphs are commonly used to represent time-resolved relations between entities in many natural and artificial systems. Many techniques were devised to investigate the evolution of temporal graphs by comparing their state at different time points. However, quantifying the similarity between temporal graphs as a whole is an open problem. Here, we use embeddings based on time-respecting random walks to introduce a new notion of distance between temporal graphs. This distance is well-defined for pairs of temporal graphs with different numbers of nodes and different time spans. We study the case of a matched pair of graphs, when a known relation exists between their nodes, and the case of unmatched graphs, when such a relation is unavailable and the graphs may be of different sizes. We use empirical and synthetic temporal network data to show that the distance we introduce discriminates graphs with different topological and temporal properties. We provide an efficient implementation of the distance computation suitable for large-scale temporal graphs.

An embedding-based distance for temporal graphs

TL;DR

Abstract

Paper Structure (23 sections, 13 equations, 4 figures, 1 table)

This paper contains 23 sections, 13 equations, 4 figures, 1 table.

Introduction
Results
Discussion
Methods
Data availability
Code availability
Acknowledgments
Authors contributions
Competing interests

Figures (4)

Figure 1: Distance computation between two temporal graphs. The inputs to the distance function are two temporal graphs $\mathcal{G}_1$ (top, orange nodes) and $\mathcal{G}_2$ (bottom, green nodes), generally with a different number of snapshots ($T_1 = 5$, $T_2 = 4$, in the example) and a different number of nodes $n$ (here $n_1 = 6$, $n_2 = 5$). Each graph is represented as a matrix $P \in\mathbb{R}^{n\times n}$, according to Eq. \ref{['eq:Pdyn']}, with entry $(ij)$ encoding the probability that a random walker goes from $i$ to $j$ following a time-respecting path (the walker's position at each time point is indicated in blue). The matrices $P_1, P_2$ are then embedded using the EDRep algorithm, mapping them to $X_1 \in \mathbb{R}^{n_1\times d}$ and $X_2 = \mathbb{R}^{n_2\times d}$. Finally, the matched -- Eq. \ref{['eq:dm']} -- and unmatched -- Eq. \ref{['eq:du']} -- distances are calculated. Notice that a necessary condition to compute $d_{\rm m}$ is that $n_1 = n_2$.
Figure 2: Validation of the distances on graphs of varying size. Panel A: Accuracy of the distance-based clustering against the ground truth classes in terms of normalized mutual information (NMI), as a function of the embedding dimension $d$ used in the embedding step shown in Figure \ref{['fig:pipeline']}. The clustering task consists in recognizing four classes of synthetic temporal graphs with an unsupervised algorithm based on the distance $d_{\rm u}$. The classes are obtained by generating a graph from either of four models (stochastic block model (SBM), configuration model (CM), Erdős-Renyi (ER), and geometric model (GM)) with constant degree equal to $4.8$, and the temporal component is obtained by sampling the edge activity of an empirical graph, as detailed in the main text. Inset of panel A: Scatter plot of UMAP dimensionality reduction in two dimensions of the vector $\bm{\lambda}$ appearing in the definition of $d_{\rm u}$ given in Eq. \ref{['eq:du']}, with $d = 32$. Each point refers to a temporal graph; the color and marker style refer to the generative model of its static component, while the marker size is proportional to the number of nodes. Panel B: $2$-dimensional UMAP embedding of $\bm{\lambda}$ for the temporal graph obtained selecting two classes and a day of interaction for all possible $(c_1, c_2, \hbox{Day})$ triplets in the SocioPatterns datasets describing temporal graphs of human proximity in schools. Each point refers to a triplet and the color and marker style are assigned according to the class labels: the primary school classes form a group on their own and the other three groups (Other-Other, Bio-Other, Bio-Bio) belong to the high school datasets, where Bio are the biology classes, and Other the remaining ones. In all cases, the temporal networks are aggregated over a scale of $t_{\rm res} = 10$ minutes.
Figure 3: Detection of partial node relabeling. Normalized matched distance $d_{\rm m}/n$ between a temporal graph and itself, upon partial node re-labeling, as a function of the fraction $\alpha$ of re-labeled nodes. The plots refer to the four generative models used in Figure \ref{['fig:clustering']}A: stochastic block model (SBM), Erdős-Renyi (ER), configuration model (CM), and geometric model (GM). with average degree equal to $4.8$. For each graph, the temporal component is obtained by sampling the edge activity of an empirical graph, as detailed in the main text. The barely visible shadow line is the standard deviation of the distance across $25$ randomizations of the partial re-labeling.
Figure 4: Distance-based graph clustering for ensembles of temporal graphs generated according to different randomization techniques.Left panel: results for the matched distance $d_{\rm m}$ of Eq. \ref{['eq:dm']}. Right panel: results for the unmatched distance $d_{\rm u}$ of Eq. \ref{['eq:du']}. Within each panel, each matrix corresponds to one of the $9$SocioPatterns temporal graphs described in Table \ref{['tab:SP']}. The rows and columns of each matrix correspond to the same set of $6$ randomization techniques we used, described in the \ref{['sec:supp']} sections. Each input graph is represented as a sequence of temporal edges $(i,j,t)$ as per Definition \ref{['def:dynamic']}. The randomizations act on the temporal edges as follows. Random: preserves the number of temporal edges and randomizes the node and time indices; Random delta: preserves the number of temporal edges and interaction duration distribution; Active snapshot: preserves the number of edges at each time-step and the times at which each node is active, i.e., at which it has at least one neighbor. Time: preserves the aggregated weighted graph structure, i.e., the number of times each edge is active in the temporal graph; Sequence; preserves each snapshot's adjacency matrix and randomizes the order in which they appear; Weighted degree: preserves the total number of temporal edges involving each node. For each pair of randomizations, we infer the randomization method of each temporal graph via an unsupervised distance-based clustering algorithm, and we compare the inferred randomization method with the known true one. Each matrix entry reports (value and color coding) the accuracy of the inferred labels (randomization methods), quantified as the normalized mutual information (NMI) between the inferred and true labels.

Theorems & Definitions (4)

Definition 1: Temporal graph
Definition 2: Matched graphs
Definition 3: Degree-corrected stochastic block model
Definition 4: Random geometric model

An embedding-based distance for temporal graphs

TL;DR

Abstract

An embedding-based distance for temporal graphs

TL;DR

Abstract

Table of Contents

Figures (4)

Theorems & Definitions (4)