Table of Contents
Fetching ...

Network reconstruction via the minimum description length principle

Tiago P. Peixoto

TL;DR

This work introduces a minimum description length (MDL)–based, nonparametric regularization for network reconstruction that avoids weight shrinkage and cross-validation. By quantizing weights and integrating a latent sparse structure with an adaptive, discrete weight distribution, the method naturally selects model complexity to compress data efficiently. The resulting framework yields sparser, more accurate reconstructions than $L_{1}$ with cross-validation and scales to very large networks, especially when combined with a subquadratic inference algorithm and optional SBM priors. Empirical case studies on microbial interaction networks demonstrate modular structure and predictive insights for interventions and tipping points, highlighting the practical impact of MDL regularization for complex systems.

Abstract

A fundamental problem associated with the task of network reconstruction from dynamical or behavioral data consists in determining the most appropriate model complexity in a manner that prevents overfitting, and produces an inferred network with a statistically justifiable number of edges. The status quo in this context is based on $L_{1}$ regularization combined with cross-validation. However, besides its high computational cost, this commonplace approach unnecessarily ties the promotion of sparsity with weight "shrinkage". This combination forces a trade-off between the bias introduced by shrinkage and the network sparsity, which often results in substantial overfitting even after cross-validation. In this work, we propose an alternative nonparametric regularization scheme based on hierarchical Bayesian inference and weight quantization, which does not rely on weight shrinkage to promote sparsity. Our approach follows the minimum description length (MDL) principle, and uncovers the weight distribution that allows for the most compression of the data, thus avoiding overfitting without requiring cross-validation. The latter property renders our approach substantially faster to employ, as it requires a single fit to the complete data. As a result, we have a principled and efficient inference scheme that can be used with a large variety of generative models, without requiring the number of edges to be known in advance. We also demonstrate that our scheme yields systematically increased accuracy in the reconstruction of both artificial and empirical networks. We highlight the use of our method with the reconstruction of interaction networks between microbial communities from large-scale abundance samples involving in the order of $10^{4}$ to $10^{5}$ species, and demonstrate how the inferred model can be used to predict the outcome of interventions in the system.

Network reconstruction via the minimum description length principle

TL;DR

This work introduces a minimum description length (MDL)–based, nonparametric regularization for network reconstruction that avoids weight shrinkage and cross-validation. By quantizing weights and integrating a latent sparse structure with an adaptive, discrete weight distribution, the method naturally selects model complexity to compress data efficiently. The resulting framework yields sparser, more accurate reconstructions than with cross-validation and scales to very large networks, especially when combined with a subquadratic inference algorithm and optional SBM priors. Empirical case studies on microbial interaction networks demonstrate modular structure and predictive insights for interventions and tipping points, highlighting the practical impact of MDL regularization for complex systems.

Abstract

A fundamental problem associated with the task of network reconstruction from dynamical or behavioral data consists in determining the most appropriate model complexity in a manner that prevents overfitting, and produces an inferred network with a statistically justifiable number of edges. The status quo in this context is based on regularization combined with cross-validation. However, besides its high computational cost, this commonplace approach unnecessarily ties the promotion of sparsity with weight "shrinkage". This combination forces a trade-off between the bias introduced by shrinkage and the network sparsity, which often results in substantial overfitting even after cross-validation. In this work, we propose an alternative nonparametric regularization scheme based on hierarchical Bayesian inference and weight quantization, which does not rely on weight shrinkage to promote sparsity. Our approach follows the minimum description length (MDL) principle, and uncovers the weight distribution that allows for the most compression of the data, thus avoiding overfitting without requiring cross-validation. The latter property renders our approach substantially faster to employ, as it requires a single fit to the complete data. As a result, we have a principled and efficient inference scheme that can be used with a large variety of generative models, without requiring the number of edges to be known in advance. We also demonstrate that our scheme yields systematically increased accuracy in the reconstruction of both artificial and empirical networks. We highlight the use of our method with the reconstruction of interaction networks between microbial communities from large-scale abundance samples involving in the order of to species, and demonstrate how the inferred model can be used to predict the outcome of interventions in the system.
Paper Structure (14 sections, 37 equations, 9 figures)

This paper contains 14 sections, 37 equations, 9 figures.

Figures (9)

  • Figure 1: $L_{1}$ regularization overfits when combined with cross validation. This example considers the reconstruction of the weighted Karate club network zachary_information_1977 ($N=34$ nodes and $E=78$ edges, weight values sampled i.i.d. from a normal distribution with mean $0.22$ and standard deviation $0.01$), based on $M=1,000$ transitions from the kinetic Ising model with a random initial state, using Eqs. \ref{['eq:MAP']} and \ref{['eq:laplace']}, for a range of values of the regularization strength $\lambda$. Panel (a) shows, from top to bottom, the Jaccard similarity between inferred and true weights and binarized edges, the mean held-out likelihood for a 5-fold cross validation, the number of inferred non-zero edges, and the individual values of inferred weights (with true nonzero edges shown in blue, and true zero-valued entries shown in red). The grey horizontal lines at the right margin of the bottom of panel (a) show the true weight values. The vertical dashed lines mark the values of $\lambda$ that maximize (b) the mean held-out likelihood and (c) the binarized Jaccard similarity. The inferred network for these two values of $\lambda$ are shown in panels (b) and (c), respectively, with edge weights represented as thickness and the color representing whether it is a true (blue) or spurious (red) edge. In panel (a), all dashed horizontal lines, as well as the red line at the right margin of the bottom panel, mark the results obtained with the MDL regularization of Sec \ref{['sec:mdl']}.
  • Figure 2: Inference results for the kinetic and equilibrium Ising model for two empirical networks, as indicated in the legend, with true weights sampled as described in the text, using both the true prior, our MDL regularization, as wel as $L_{1}$ regularization with 5-fold cross-validation. The individual panels show the Jaccard similarity $s(\bm{W},\hat{\bm{W}})$ between the inferred and true networks, as well as the number $E$ of inferred non-zero edges (the dashed horizontal line shows the true value).
  • Figure 3: Reconstructed networks of interactions between $N=623$ members of the lower house of the Brazilian congress peixoto_network_2019, during the 2007 to 2011 term, corresponding to $M=619$ voting sessions. Panel (a) shows the network inferred with MDL regularization, with edge weights corresponding to their thickness, and the node colors indicating the division found with the SBM incorporated into the regularization. Panel (b) shows the result with $L_{1}$ together with 5-fold cross validation, and negative weights shown in red. The colors indicate the group assignments found by fitting an SBM to the resulting network. Panel (c) shows the weight distributions obtained with both methods, and (d) the mean held-out likelihood of a 5-fold cross validation with each method, including also the MDL version without the SBM.
  • Figure 4: Decimation procedure decelle_pseudolikelihood_2014 employed on the same data as Fig. \ref{['fig:camara']}. The bottom panel shows the maximum likelihood growing monotonically as a function of the number of nonzero edges considered during decimation, and the top shows the similarity (weighted and binarized) with the network inferred via MDL. The vertical dashed line corresponds to the value $E=24,196$, obtained using the stopping criterion proposed in Ref. decelle_pseudolikelihood_2014.
  • Figure 5: Reconstructed networks for the (a) human microbiome project (HMP) with $N=45,383$, $M=4,788$, and $E=122,804$ and (b) earth microbiome project (EMP) with $N=126,730$, $M=23,323$, and $E=735,868$, using the MDL regularization method, together with a SBM prior. The top panels show the networks obtained, with edge weights indicated as colors. The middle panel shows the OTU counts in (a) different body sites and (b) earth habitats (the color code is the same as the "total count" panels in Fig. \ref{['fig:taxonomy_emp']}.). The bottom panel shows the edge weight distributions (with the same colors as the top panel), the node field distributions, the degree distributions (where node $i$ has total degree $k_{i}=\sum_{j}A_{ij}$, as well as positive and negative degrees, $k^{+}=\sum_{j}A_{ij}\mathds{1}_{W_{ij}>0}$ and $k^{-}=\sum_{j}A_{ij}\mathds{1}_{W_{ij}<0}$, respectively), the node strength distributions (where node $i$ has total strength $d_{i}=\sum_{j}W_{ij}$, as well as positive and negative strenght, $d_{i}^{+}=\sum_{j}W_{ij}\mathds{1}_{W_{ij}>0}$ and $d_{i}^{-}=\sum_{j}W_{ij}\mathds{1}_{W_{ij}<0}$, respectively), and the average node fields as a function of the degrees (with the expected "magnetization" in the inset).
  • ...and 4 more figures