Table of Contents
Fetching ...

Network-based Neighborhood regression

Yaoming Zhen, Jin-Hong Du

TL;DR

This work introduces a network-based neighborhood regression (NBNR) framework that jointly leverages local neighborhood structure and global community information to model directional regulation between gene modules. By enforcing a block structure on the regression coefficients, the authors decompose the estimation into independent community-wise least squares problems (CLSE) with closed-form solutions and establish non-asymptotic, concentration-based guarantees, including unbiasedness and minimax optimality. The analysis shows a striking linear-in-$n$ consistency when the network is sufficiently dense and well-structured, illustrating the advantage of incorporating neighborhood information over classical root-$n$ convergence. The method is validated through simulations and applied to Autism Spectrum Disorder data, where it uncovers interpretable inter- and intra-community regulatory patterns and achieves superior predictive performance relative to network-naive approaches, while providing a novel adjusted $R^2$ measure for model fit in a networked, modular setting.

Abstract

Given the ubiquity of modularity in biological systems, module-level regulation analysis is vital for understanding biological systems across various levels and their dynamics. Current statistical analysis on biological modules predominantly focuses on either detecting the functional modules in biological networks or sub-group regression on the biological features without using the network data. This paper proposes a novel network-based neighborhood regression framework whose regression functions depend on both the global community-level information and local connectivity structures among entities. An efficient community-wise least square optimization approach is developed to uncover the strength of regulation among the network modules while enabling asymptotic inference. With random graph theory, we derive non-asymptotic estimation error bounds for the proposed estimator, achieving exact minimax optimality. Unlike the root-n consistency typical in canonical linear regression, our model exhibits linear consistency in the number of nodes n, highlighting the advantage of incorporating neighborhood information. The effectiveness of the proposed framework is further supported by extensive numerical experiments. Application to whole-exome sequencing and RNA-sequencing Autism datasets demonstrates the usage of the proposed method in identifying the association between the gene modules of genetic variations and the gene modules of genomic differential expressions.

Network-based Neighborhood regression

TL;DR

This work introduces a network-based neighborhood regression (NBNR) framework that jointly leverages local neighborhood structure and global community information to model directional regulation between gene modules. By enforcing a block structure on the regression coefficients, the authors decompose the estimation into independent community-wise least squares problems (CLSE) with closed-form solutions and establish non-asymptotic, concentration-based guarantees, including unbiasedness and minimax optimality. The analysis shows a striking linear-in- consistency when the network is sufficiently dense and well-structured, illustrating the advantage of incorporating neighborhood information over classical root- convergence. The method is validated through simulations and applied to Autism Spectrum Disorder data, where it uncovers interpretable inter- and intra-community regulatory patterns and achieves superior predictive performance relative to network-naive approaches, while providing a novel adjusted measure for model fit in a networked, modular setting.

Abstract

Given the ubiquity of modularity in biological systems, module-level regulation analysis is vital for understanding biological systems across various levels and their dynamics. Current statistical analysis on biological modules predominantly focuses on either detecting the functional modules in biological networks or sub-group regression on the biological features without using the network data. This paper proposes a novel network-based neighborhood regression framework whose regression functions depend on both the global community-level information and local connectivity structures among entities. An efficient community-wise least square optimization approach is developed to uncover the strength of regulation among the network modules while enabling asymptotic inference. With random graph theory, we derive non-asymptotic estimation error bounds for the proposed estimator, achieving exact minimax optimality. Unlike the root-n consistency typical in canonical linear regression, our model exhibits linear consistency in the number of nodes n, highlighting the advantage of incorporating neighborhood information. The effectiveness of the proposed framework is further supported by extensive numerical experiments. Application to whole-exome sequencing and RNA-sequencing Autism datasets demonstrates the usage of the proposed method in identifying the association between the gene modules of genetic variations and the gene modules of genomic differential expressions.
Paper Structure (41 sections, 13 theorems, 153 equations, 5 figures, 1 table)

This paper contains 41 sections, 13 theorems, 153 equations, 5 figures, 1 table.

Key Result

Proposition 1

The stationary point of $\partial \mathcal{L}_k(\bm{\beta}_{k, .})/ \partial\bm{\beta}_{k, .}$ is the solution to the following system of normal equations: for any $k\in[K]$. Particularly, when $\bm{M}_k$ has rank $K$, it follows that $\widehat{\bm{\beta}}_{k,\cdot} = (\bm{M}_k^\top\bm{M}_k)^{-1}\bm{M}_k^{\top}\bm{y}$.

Figures (5)

  • Figure 1: Network-based neighborhood regression. (a) The covariate $\bm{x}$ and response $\bm{y}$ are observed for each node in a network $\bm{A}$. (b) The community-wise interactions are modeled by the coefficient matrix $\bm{\beta}$. (c) The conditional mean of a particular response is the average of the covariates in its neighborhood weighted by the community-wise interaction strengths.
  • Figure 2: Estimation and prediction errors in the log scale of experiments in \ref{['subsec:simu-net']} with varying numbers of communities. The shaded regions represent the standard errors around the average values computed over 200 simulated datasets. For netcoh, only the prediction errors are available and presented.
  • Figure 3: Estimation and prediction errors in the log scale of experiments in \ref{['subsec:simu-coef']} with different coefficient structures. The shaded regions represent the standard errors around the average values computed over 200 simulated datasets.
  • Figure 4: Visualization of ASD data and detected communities. (a) The histograms of GR one-sided z-scores ($\bm{x}$) and DE two-sided z-scores ($\bm{y}$) for 864 substantial autism genes. (b) The scree plot of the singular values of $\bm{A}$. (c) The gene co-expression network colored by z-values and grouped by estimated communities. (d) The adjacency matrix $\bm{A}$ colored by connectivity (white for 1 and black for 0) and ordered by estimated assignments.
  • Figure 5: Top gene ontology terms for genes in community 3, 4, and 6.

Theorems & Definitions (14)

  • Proposition 1: Stationary point
  • Corollary 2: Asymptotic normality
  • Theorem 3: Fisher information concentration
  • Corollary 4
  • Remark 5: Matrix quadratic form
  • Theorem 6: Unbiasedness and Consistency
  • Corollary 7
  • Theorem 8: Community membership misspecification
  • Theorem 9: Community-wise best linear unbiased estimator
  • Theorem 10: Minimax optimality
  • ...and 4 more