Table of Contents
Fetching ...

$U$-statistics on bipartite exchangeable networks

Tâm Le Minh

TL;DR

This work develops a theory of quadruplet $U$-statistics on bipartite row-column exchangeable networks, proving a general weak convergence result and a central limit theorem in the dissociated case. Leveraging backward martingale methods and the Aldous–Hoover representation, the authors show that the limiting distribution is a mixture of Gaussians in general and Gaussian when dissociated, with explicit variance structures. They apply the theory to Bipartite Expected Degree Distribution (BEDD) models, deriving identifiability by quadruplets and providing practical inference tools for row heterogeneity, network comparison, and motif frequencies, supported by simulations. The framework yields computationally tractable estimators based on simple matrix operations, with clear directions for extensions to larger subgraphs and graphon-based models, and highlights avenues for finite-sample refinements.

Abstract

Bipartite networks with exchangeable nodes can be represented by row-column exchangeable matrices. A quadruplet is a submatrix of size $2 \times 2$. A quadruplet $U$-statistic is the average of a function on a quadruplet over all the quadruplets of a matrix. We prove several asymptotic results for quadruplet $U$-statistics on row-column exchangeable matrices, including a weak convergence result in the general case and a central limit theorem when the matrix is also dissociated. These results are applied to statistical inference in network analysis. We suggest a method to perform parameter estimation, network comparison and motifs count for a particular family of row-column exchangeable network models: the bipartite expected degree distribution (BEDD) models. These applications are illustrated by simulations.

$U$-statistics on bipartite exchangeable networks

TL;DR

This work develops a theory of quadruplet -statistics on bipartite row-column exchangeable networks, proving a general weak convergence result and a central limit theorem in the dissociated case. Leveraging backward martingale methods and the Aldous–Hoover representation, the authors show that the limiting distribution is a mixture of Gaussians in general and Gaussian when dissociated, with explicit variance structures. They apply the theory to Bipartite Expected Degree Distribution (BEDD) models, deriving identifiability by quadruplets and providing practical inference tools for row heterogeneity, network comparison, and motif frequencies, supported by simulations. The framework yields computationally tractable estimators based on simple matrix operations, with clear directions for extensions to larger subgraphs and graphon-based models, and highlights avenues for finite-sample refinements.

Abstract

Bipartite networks with exchangeable nodes can be represented by row-column exchangeable matrices. A quadruplet is a submatrix of size . A quadruplet -statistic is the average of a function on a quadruplet over all the quadruplets of a matrix. We prove several asymptotic results for quadruplet -statistics on row-column exchangeable matrices, including a weak convergence result in the general case and a central limit theorem when the matrix is also dissociated. These results are applied to statistical inference in network analysis. We suggest a method to perform parameter estimation, network comparison and motifs count for a particular family of row-column exchangeable network models: the bipartite expected degree distribution (BEDD) models. These applications are illustrated by simulations.

Paper Structure

This paper contains 45 sections, 38 theorems, 156 equations, 6 figures, 1 table.

Key Result

Proposition 2.2

$m_N$ and $n_N$ satisfy:

Figures (6)

  • Figure 1: Estimation of $F_2$: Frequency of the confidence intervals that contain the true value of $F_2$ for different values of $N$ (on a logarithmic scale). For each $N \in \{8,16,32,64,128,256,512,1024,2048\}$, we simulate $K = 1000$ networks with $\lambda=1$, $F_2=3$, $G_2=2$. For each simulated network, we estimate $F_2$ with the estimator $\widehat{\theta}_N$ and at level $\alpha = 0.95$, we build the asymptotic confidence intervals from the weak convergence results: [vdt] built from \ref{['eq:convergence_vdt']} (true value of $V^\delta$) and [vd] built from \ref{['eq:convergence_vd']} (estimated value of $V^\delta$ by $\widehat{V}^\delta_N$). The horizontal dashed lines represent the confidence interval at level $0.95$ of the frequency $Z = X/K$, if $X$ follows the binomial distribution with parameters $K$ and $\alpha$.
  • Figure 2: Estimation of $F_2$: Distribution of $\widehat{\theta}_N$ for different values of $N$. For each $N \in \{8,16,32,64,128,256,512,1024,2048\}$, we simulate $K = 1000$ networks with $\lambda=1$, $F_2=3$, $G_2=2$. For each simulated network, we estimate $F_2$ with the estimator $\widehat{\theta}_N$. The empirical distributions (solid red lines) are interpolated using the density() function from base R stats package. The dashed curves in blue correspond to the normal distribution densities with mean $F_2 = 3$ and variance $V^\delta/N$. Under each plot, the value of the Kolmogorov-Smirnov test statistic $D$ between the empirical distribution of $\widehat{\theta}_N$ and the normal distribution with mean $F_2 = 3$ and variance $V^\delta/N$ is given. $D=\sup_x | F_{emp}(x) - F(x) |$ where $F_{emp}$ is the empirical c.d.f. of $\widehat{\theta}_N$ and $F(x)$ the c.d.f. of the normal distribution with mean $F_2 = 3$ and variance $V^\delta/N$.
  • Figure 3: Comparison of $F_2$ for two networks: Power of the test $\mathcal{H}_0 : F^A_2 = F^B_2$ vs. $\mathcal{H}_1 : F^A_2 \neq F^B_2$ using the statistic $Z_N(Y^A, Y^B)$ defined by \ref{['eq:test_statistic_comparison']}. We set $\lambda^A = \lambda^B = 1$, $G_2^A = G_2^B = 2$, $c^A = c^B = 0.5$. The value of $F^A_2$ is fixed at $3$. Only $N$ and $F_2^B$ will vary. Several values of $F^B_2$ are considered between $1$ and $5$. For each $N \in \{32,64,128,256,512,1024\}$, for each $F^B_2$, we generate $K = 200$ couple of networks of same size $N^A = N^B = N/2$ with respective $F_2$ values $F^A_2$ and $F^B_2$. On each couple of networks $(Y^A, Y^B)$, we compute $Z_N(Y^A, Y^B)$ and we reject the hypothesis $\mathcal{H}_0$ if $Z_N(Y^A, Y^B) \not\in I(\alpha)$. The empirical power (solid lines) is the frequency with which the hypothesis is admitted among the $K$ simulations. The theoretical power (dashed lines) is the function $\psi_N(\Theta^A, \Theta^B)$, which only depends on $F^B_2$ since the other parameters are constant, computed with equation \ref{['eq:theoretical_power']}.
  • Figure 4: Motif counted by $U^{h_7}_N$. The circles and the squares represent the two types of nodes of a bipartite network. Assuming that the circles correspond to the rows and the squares to the columns of the adjacency matrix, then the submatrix associated to this subgraph is $Y_{(1,2;1,2)} = 1101$ (figure taken from ouadah2022motif).
  • Figure 5: Motifs count: Frequency of the confidence intervals that contain the theoretical value $T(\Theta)$ for different values of $N$ (on a logarithmic scale). For each $N \in \{8,16,32,64,128,256,512,1024,2048\}$, we simulate $K = 1000$ networks with $F_2=2$, $G_2=2$, $\lambda=0.9 \lambda_M$. For each simulated network, we determine the motif frequency with the estimator $U^{h_7}_N$ and at level $\alpha = 0.95$, we build the asymptotic confidence intervals from the weak convergence results: [vt] built from \ref{['eq:convergence_motif_th']} (true value of $V^{h_7}$) and [v] built from \ref{['eq:convergence_motif']} (estimated value of $V^{h_7}$ by $\widehat{V}_N$). The horizontal dashed lines represent the confidence interval at level $0.95$ of the frequency $Z = X/K$, if $X$ follows the binomial distribution with parameters $K$ and $\alpha$.
  • ...and 1 more figures

Theorems & Definitions (82)

  • Definition 2.1: Sequence of dimensions
  • Proposition 2.2
  • Corollary 2.3
  • Definition 2.4
  • Theorem 2.5: Main theorem
  • Definition 2.6
  • Theorem 2.7
  • Theorem 2.8
  • Remark
  • Theorem 2.9
  • ...and 72 more