Compression-based inference of network motif sets

Alexis Bénichou; Jean-Baptiste Masson; Christian L. Vestergaard

Compression-based inference of network motif sets

Alexis Bénichou, Jean-Baptiste Masson, Christian L. Vestergaard

TL;DR

This work develops a framework for motif mining based on lossless network compression using subgraph contractions and provides an alternative definition of motif significance which allows to compare different motifs and select the collectively most significant set of motifs as well as other prominent network features in terms of their combined compression of the network.

Abstract

Physical and functional constraints on biological networks lead to complex topological patterns across multiple scales in their organization. A particular type of higher-order network feature that has received considerable interest is network motifs, defined as statistically regular subgraphs. These may implement fundamental logical and computational circuits and are referred as "building blocks of complex networks". Their well-defined structures and small sizes also enables the testing of their functions in synthetic and natural biological experiments. The statistical inference of network motifs is however fraught with difficulties, from defining and sampling the right null model to accounting for the large number of possible motifs and their potential correlations in statistical testing. Here we develop a framework for motif mining based on lossless network compression using subgraph contractions. The minimum description length principle allows us to select the most significant set of motifs as well as other prominent network features in terms of their combined compression of the network. The approach inherently accounts for multiple testing and correlations between subgraphs and does not rely on a priori specification of an appropriate null model. This provides an alternative definition of motif significance which guarantees more robust statistical inference. Our approach overcomes the common problems in classic testing-based motif analysis. We apply our methodology to perform comparative connectomics by evaluating the compressibility and the circuit motifs of a range of synaptic-resolution neural connectomes.

Compression-based inference of network motif sets

TL;DR

Abstract

Paper Structure (19 sections, 96 equations, 15 figures, 5 tables, 4 algorithms)

This paper contains 19 sections, 96 equations, 15 figures, 5 tables, 4 algorithms.

Subgraph contraction.
Multigraphs.
Simple graphs.
Model complexity.
Multigraphs.
Simple graphs.
Model complexity.
Multigraphs.
Simple graphs.
Model complexity.
Multigraphs.
Simple graphs.
Model complexity.
Uniform code.
Plug-in code.
...and 4 more sections

Figures (15)

Figure 1: Graphlet-based graph compression. (A) Reduced representation of a graph $G$ obtained by contracting subgraphs into colored supernodes representing the subgraphs. (In this example, two different graphlets, colored in blue and green, are selected) The cost for encoding the reduced representation can be split into two parts: (i) encoding the multigraph $H$ obtained by contracting subgraphs in $G$, $L_{}(H, \phi)$ (See "\ref{['methods:base-codes']}" section), and (ii) encoding which nodes in $H$ are supernodes and their color, designating which graphlet they represent, $L_{}(\mathcal{V}|H,\mathcal{S})$ [Eq. \ref{['eq:supernodeCost']}]. (B) Hierarchy of the four different dyadic graph models gauvin_randomized_2022 used as base codes. Each node in the diagram represents a model. An edge between two nodes indicates that the upper model is less random than the lower. The models are: the Erdős-Rényi model $P_{{(N,E)}}$ (cyan); the directed configuration model $P_{{(\mathbf{k}^+,\mathbf{k}^-)}}$ (orange); the reciprocal Erdős-Rényi model $P_{{(N,E_m,E_d)}}$ (pink); and the reciprocal configuration model $P_{{(\bm{\kappa}^m,\bm{\kappa}^+,\bm{\kappa}^-)}}$ (yellow). (C-E) Encoding the additional information necessary for lossless reconstruction of $G$ from $H$, incurs a cost $L_{}(G|H, \mathcal{V}, \mathcal{S}, \Gamma)$ (Eq. \ref{['eq:reconstructionCost']}) that is equal to the sum of three terms for each supernode, corresponding to encoding the labels of the nodes inside the graphlet, i.e., the graphlet's orientation (C), and how the graphlet's nodes are wired to other nodes in $H$ (D,E). (C) Encoding the orientation of a graphlet is equivalent to specifying its automorphism class. For the graphlet shown in the example there are 3 possible distinguishable orientations, leading to a codelength of $\log 3$. (D) Encoding the connections between a simple node and a supernode involves designating to which nodes in the graphlet the in- and out-going edges to the supernode are connected. In this example, there are $\binom{4}{2}$ possible wiring configurations for both the in- and out-going edges, leading to a wiring cost of $\log 36$ (see Eq. \ref{['eq:rewiringCost']}). (E) Encoding the wiring configuration of the edges from a supernode $i$ to another supernode $j$ involves designating the edges from the group of nodes of supernode $i$ to the group of nodes in $j$ in the bipartite graph composed of the two groups (the edges from $j$ to $i$ are accounted for in the encoding of $j$). Here, there are $\binom{20}{1}$ such configurations, leading to a rewiring cost of $\log 20$ bits.
Figure 1: Distribution of graph polynomial root (GPR) values of all 3- to 5-node graphlets. The minimum value of the GPR is 1/5 for five-node graphlets. It would be 0 in an infinite, maximally asymmetric graph, e.g., one where the automorphism group is a singleton. A GPR of 1, i.e., its maximum value for any graph size, represent maximally symmetric graphs, i.e., cliques or empty graphs. The symmetry of inferred motif sets in Fig 5 in the "Results" section should be interpreted knowing that the GPR is bounded between 0.2 and 1.
Figure 2: Greedy optimization algorithm. (A) Illustration of a single step of the greedy stochastic algorithm. The putative compression $\Delta L_{}(G,\theta,s)$ that would be obtained by contracting each of the subgraphs in the minibatch is calculated, and the subgraph contraction resulting in the highest compression is selected (highlighted in blue). (B) Example of motif set inferred in the connectome of the right hemisphere of the mushroom bodies (MB right) of the Drosophila larva. (C) Evolution of the codelength during a single algorithm run. The algorithm is continued until no more subgraphs can be contracted. The representation $\theta^* = \theta_t$ with the shortest codelength is selected; here, after the 31st iteration (indicated by a vertical black dashed line). The horizontal orange dashed line indicates the codelength of the corresponding simple graph model without motifs (see \ref{['methods:null_models']}). (D) The algorithm is run a hundred times for each dyadic base model and the most compressing model $\hat{\theta}$ is selected. Histograms represent the codelengths of models with motifs after each run of the greedy algorithm; colors correspond to the different base models (blue: ER model, orange: configuration model, pink: reciprocal ER model, yellow: reciprocal configuration model, see Fig \ref{['fig:model']}B and Table \ref{['tab:codelengths']}); vertical dashed lines represent the codelengths of models without motifs, and the black dashed line indicates the codelength of the shortest-codelength model---here the configuration model with motifs.
Figure 3: Performance of compression-based motif inference on numerically generated networks. (A-D) Number of spurious motifs inferred using our compression-based method with MDL-based model selection and using hypothesis testing with four different null models in random networks generated from the same four null models: (A) the Erdős-Rényi model (ER); (B) the configuration model (CM); (C) the reciprocal ER model (RER); and (D) the reciprocal CM (RCM). The x-axis labels indicate which method was used for motif inference: our method (MDL) or classic hypothesis testing with each of the four null models as reference. The corresponding generative model is highlighted in boldface. To make hypothesis testing as conservative as possible, we applied a Bonferroni correction, which multiplies the raw $p$-values by $|\Gamma| = 9\,576$ and we set the uncorrected significance threshold to $0.01$. The random networks in (A-D) are all generated by fixing the values of each null model's parameters to those of the Drosophila larva right MB connectome (e.g., $N=198$ and $E=6\,499$ for the ER model). (E-H) Ability of our method to correctly identify a placed graphlet as a motif as a function of the number of times it is repeated, $m_{\alpha}$. We show results for two selected 5-node graphlets: an hourglass structure (top row) and a clique (bottom row). The clique is the densest graphlet and is totally symmetric (the number of orientations, i.e., the number of non-automorphic node permutations, is equal to one). The hourglass has intermediary density, $\rho_\alpha = 2/5$, and symmetry, with 60 non-automorphic orientations within a possible range of 1 to $5! = 120$. The generated networks in (E-H) contain $N = 300$ nodes and an edge density of either $\rho = E/N(N-1) = 0.025$ (E,G) or $\rho=0.1$ (F,H). Each point is an average over five independently generated graphs. (E,F) The discovery rate is the estimated probability that the planted motif belongs to the inferred motif set, i.e., $\langle 1-\delta(m_\alpha,0) \rangle$. (G,H) Average inferred number of repetitions of the planted motif, $\langle m_\alpha\rangle$.
Figure 4: Compressibility of neural connectomes. Compressibility (measured in number of bits per edge in the network) $\Delta L_{}^*/E$ of different connectomes as compared to encoding the edges independently using the Erdős-Rényi simple graph model (see Table \ref{['tab:codelengths']}). Two types of models are shown for the datasets: the best simple network encoding and the best motif-based encoding when this compresses more than the simple encoding. Asterisks highlight connectomes where motifs permit a higher compression than the reference models. (A) Whole-CNS and whole-animal connectomes. (B) Connectomes of three different regions of the adult Drosophila right hemibrain. Note that while the relative increase in compressibility of these connectomes obtained using motifs is relatively small, the motifs are highly significant due to the large size of these connectomes (Table \ref{['tab:datasets']}). (C) Connectomes of different brain regions of first instar Drosophila larva. (D) Connectomes of C. elegans head ganglia at different developmental stages, from 0 hours to 50 (adult). While no higher-order motifs are found, the compressibility increases with maturation (and thus the size) of the connectome.
...and 10 more figures

Compression-based inference of network motif sets

TL;DR

Abstract

Compression-based inference of network motif sets

Authors

TL;DR

Abstract

Table of Contents

Figures (15)