Table of Contents
Fetching ...

A Bayesian approach to out-of-sample network reconstruction

Mattia Marzi, Tiziano Squartini

TL;DR

A Bayesian approach is developed that uses the information about past network snapshots to inform a prior and predict the subsequent ones, while quantifying uncertainty, enabling self-sustained, out-of-sample reconstruction of evolving networks with a minimal amount of additional data.

Abstract

Networks underpin systems that range from finance to biology, yet their structure is often only partially observed. Current reconstruction methods typically fit the parameters of a model anew to each snapshot, thus offering no guidance to predict future configurations. Here, we develop a Bayesian approach that uses the information about past network snapshots to inform a prior and predict the subsequent ones, while quantifying uncertainty. Instantiated with a single-parameter fitness model, our method infers link probabilities from node strengths and carries information forward in time. When applied to the Electronic Market for Interbank Deposit across the years 1999-2012, our method accurately recovers the number of connections per bank at subsequent times, outperforming probabilistic benchmarks designed for analogous, link prediction tasks. Notably, each predicted snapshot serves as a reliable prior for the next one, thus enabling self-sustained, out-of-sample reconstruction of evolving networks with a minimal amount of additional data.

A Bayesian approach to out-of-sample network reconstruction

TL;DR

A Bayesian approach is developed that uses the information about past network snapshots to inform a prior and predict the subsequent ones, while quantifying uncertainty, enabling self-sustained, out-of-sample reconstruction of evolving networks with a minimal amount of additional data.

Abstract

Networks underpin systems that range from finance to biology, yet their structure is often only partially observed. Current reconstruction methods typically fit the parameters of a model anew to each snapshot, thus offering no guidance to predict future configurations. Here, we develop a Bayesian approach that uses the information about past network snapshots to inform a prior and predict the subsequent ones, while quantifying uncertainty. Instantiated with a single-parameter fitness model, our method infers link probabilities from node strengths and carries information forward in time. When applied to the Electronic Market for Interbank Deposit across the years 1999-2012, our method accurately recovers the number of connections per bank at subsequent times, outperforming probabilistic benchmarks designed for analogous, link prediction tasks. Notably, each predicted snapshot serves as a reliable prior for the next one, thus enabling self-sustained, out-of-sample reconstruction of evolving networks with a minimal amount of additional data.
Paper Structure (29 sections, 76 equations, 13 figures)

This paper contains 29 sections, 76 equations, 13 figures.

Figures (13)

  • Figure 1: Top panels: empirical values of the total number of links (red) and node degrees (blue) scattered versus the predicted ones, pooled across the weeks constituting our dataset; the dashed line marks the identity. Middle panels: evolution of the relative error on the total number of links (red) and the average relative error on the nodes degrees (blue), across the weeks constituting our dataset. Bottom panels: evolution of the $\langle\text{TPR}\rangle$, the $\langle\text{PPV}\rangle$, the $\langle\text{TNR}\rangle$ and the $\langle\text{ACC}\rangle$ across the weeks constituting our dataset. The results concerning the BERM are shown on the left while those concerning the BFM are shown on the right: while both models recover the total number of links and achieve a large $\langle\text{ACC}\rangle$ score, driven by the large value of the $\langle\text{TNR}\rangle$, only the BFM is capable of recovering the degree sequence to an acceptable degree of accuracy - as well as more than doubling the other scores.
  • Figure 2: Left panel evolution of the TPR, the JI and AUROC across the weeks constituting our dataset. Right panel: ROC curves for all snapshots. The purple one represents the average ROC, obtained by interpolating each snapshot-specific ROC on a common grid of FPR values and averaging the corresponding TPR values. These ranking-based diagnostics are meaningful only for the BFM, inducing a non-trivial ordering of candidate links.
  • Figure 3: Left panel: values of the Kullback-Leibler divergence between $\mathbf{A}$ and its ensemble average $\mathbf{Q}$ scattered versus the values of the Kullback-Leibler divergence between $\mathbf{A}$ and its 'self-sustained' inferred version $\mathbf{R}$, pooled across the weeks constituting our dataset. Right panel: values of the total number of links (red) and node degrees (blue) predicted by employing $\mathbf{Q}$ scattered versus the values predicted by employing $\mathbf{R}$, pooled across the weeks constituting our dataset. Both plots confirm that $\mathbf{Q}$ represents a reliable surrogate of $\mathbf{A}$ - in fact, so accurate to constitute a valid prior for subsequent inference.
  • Figure 4: Left panel: empirical adjacency matrix $\mathbf{A}_{t+1}$ corresponding to the week $\#20$ of the year $2007$. Middle panel: ensemble average of $\mathbf{A}_{t+1}$, i.e. $\mathbf{Q}_{t+1}$. Right panel: 'self-sustained', inferred version of $\mathbf{A}_{t+1}$, i.e. $\mathbf{R}_{t+1}$. While $\mathbf{Q}_{t+1}$ needs the information provided by $\mathbf{A}_t$, $\mathbf{R}_{t+1}$ 'only' needs the information provided by $\mathbf{Q}_t$, i.e. an estimate of $\mathbf{A}_t$. More quantitatively, $2\sum_{i=1}^N\sum_{j(>i)}|q_{ij}-r_{ij}|/N(N-1)\simeq0.006$ and $2\sum_{i=1}^N\sum_{j(>i)}|a_{ij}-q_{ij}|/N(N-1)\simeq0.152\simeq2\sum_{i=1}^N\sum_{j(>i)}|a_{ij}-r_{ij}|/N(N-1)$.
  • Figure 5: Metric-specific distribution of the improvement of the 'self-sustained' Bayesian predictor with respect to the (in-sample) dcGM. For each snapshot and metric $m$, we define the score $I_m$ in two different ways, i.e. as $I_m=(m_{\text{dcGM}}-m_{\text{Bayes}})/|m_{\mathrm{dcGM}}|$ for the $\text{ARE}_k$ and the $\text{MRE}_k$ and as $I_m=(m_{\mathrm{Bayes}}-m_{\mathrm{dcGM}})/|m_{\mathrm{dcGM}}|$ for the $\langle\text{TPR}\rangle$, the $\langle\text{PPV}\rangle$, the $\langle\text{TNR}\rangle$ and the $\langle\text{ACC}\rangle$: in both cases, values above the $0\%$ dashed line indicate that the Bayesian predictor performs better than the (in-sample) dcGM. Each violin plot summarizes the distribution of the improvement, showing that our fully predictive procedure frequently matches, and sometimes exceeds, the in-sample reconstruction calibrated by taking $L$ as input at each time step.
  • ...and 8 more figures