Table of Contents
Fetching ...

The Wreaths of KHAN: Uniform Graph Feature Selection with False Discovery Rate Control

Jiajun Liang, Yue Liu, Doudou Zhou, Sinian Zhang, Junwei Lu

TL;DR

A novel inferential framework for general high dimensional graphical models to select graph features with false discovery rate controlled and the structural screening method is applied to identify the important residues of the SARS-CoV-2 spike protein during the binding process to the ACE2 receptors.

Abstract

Graphical models find numerous applications in biology, chemistry, sociology, neuroscience, etc. While substantial progress has been made in graph estimation, it remains largely unexplored how to select significant graph signals with uncertainty assessment, especially those graph features related to topological structures including cycles (i.e., wreaths), cliques, hubs, etc. These features play a vital role in protein substructure analysis, drug molecular design, and brain network connectivity analysis. To fill the gap, we propose a novel inferential framework for general high dimensional graphical models to select graph features with false discovery rate controlled. Our method is based on the maximum of $p$-values from single edges that comprise the topological feature of interest, thus is able to detect weak signals. Moreover, we introduce the $K$-dimensional persistent Homology Adaptive selectioN (KHAN) algorithm to select all the homological features within $K$ dimensions with the uniform control of the false discovery rate over continuous filtration levels. The KHAN method applies a novel discrete Gram-Schmidt algorithm to select statistically significant generators from the homology group. We apply the structural screening method to identify the important residues of the SARS-CoV-2 spike protein during the binding process to the ACE2 receptors. We score the residues for all domains in the spike protein by the $p$-value weighted filtration level in the network persistent homology for the closed, partially open, and open states and identify the residues crucial for protein conformational changes and thus being potential targets for inhibition.

The Wreaths of KHAN: Uniform Graph Feature Selection with False Discovery Rate Control

TL;DR

A novel inferential framework for general high dimensional graphical models to select graph features with false discovery rate controlled and the structural screening method is applied to identify the important residues of the SARS-CoV-2 spike protein during the binding process to the ACE2 receptors.

Abstract

Graphical models find numerous applications in biology, chemistry, sociology, neuroscience, etc. While substantial progress has been made in graph estimation, it remains largely unexplored how to select significant graph signals with uncertainty assessment, especially those graph features related to topological structures including cycles (i.e., wreaths), cliques, hubs, etc. These features play a vital role in protein substructure analysis, drug molecular design, and brain network connectivity analysis. To fill the gap, we propose a novel inferential framework for general high dimensional graphical models to select graph features with false discovery rate controlled. Our method is based on the maximum of -values from single edges that comprise the topological feature of interest, thus is able to detect weak signals. Moreover, we introduce the -dimensional persistent Homology Adaptive selectioN (KHAN) algorithm to select all the homological features within dimensions with the uniform control of the false discovery rate over continuous filtration levels. The KHAN method applies a novel discrete Gram-Schmidt algorithm to select statistically significant generators from the homology group. We apply the structural screening method to identify the important residues of the SARS-CoV-2 spike protein during the binding process to the ACE2 receptors. We score the residues for all domains in the spike protein by the -value weighted filtration level in the network persistent homology for the closed, partially open, and open states and identify the residues crucial for protein conformational changes and thus being potential targets for inhibition.
Paper Structure (41 sections, 18 theorems, 148 equations, 10 figures, 2 tables, 4 algorithms)

This paper contains 41 sections, 18 theorems, 148 equations, 10 figures, 2 tables, 4 algorithms.

Key Result

Proposition 4.0

For GGM with parameter space $\Theta \in {\cal U}(s)$ defined in eqn:us, suppose $s^2\log^4(dn)/n=o(1)$, the estimator $\widehat{W}_e$ and $\widehat{\sigma}_{e}^2$ defined in eq:thetad with $W_{uv}$ in eq:That:GGM2 satisfy Assumption assum:That:condition.

Figures (10)

  • Figure 1: Illustration of the filtered graph, homological features, and the persistent barcode. Two $1$-dimensional homological features (blue, i.e., $\mathrm{rank}(Z_1(E(\mu_2))) = 2$) show up at $\mu_2$ and a $2$-dimensional homological feature (yellow, i.e., $\mathrm{rank}(Z_2(E(\mu_3))) = 1$) appears at $\mu_3$.
  • Figure 2: (Left) Illustration of the method of selecting homological features for persistent homology. At each iteration $t$, all the edges on the graph make the filtered edge set $E^{(t)}$, the weight of the red edge(s) represents the next filtration level $\mu^{(t+1)}$, the yellow cycles represent the remaining generator(s) in the next iteration, and the orange cycle(s) represents the disappearing generator(s) in the next iteration. The blue horizontal lines represent the life time of each cycle along the filtration of graph. (Right) Illustrations in 3D of the filtration of graphs when the filtration level $\mu$ increases.
  • Figure 3: This figure presents a comprehensive visualization of the connectivity and correlation analysis among residues across different domains. Panel A highlights the strong connectivity of residues from the RBD (the yellow nodes) with those from other domains. Panel B showcases the correlation analysis, emphasizing the strong associations between residues within the RBD and those in the NTD, the linker domain, and a portion of the S2 domain. Panel C summarizes the top 10 residues identified at each stage based on their importance scores.
  • Figure 4: Illustration of the conformational changes and disruption of hydrogen bonds in residues K386/S683 and A570/I569 across three stages.
  • Figure 5: Visualization of the location of residues within the protein structure, emphasizing their proximity to the RBD. Color coding: Chain A in red, Chain B in blue, Chain C in yellow. Domains are distinguished as follows: NTD in light blue, RBD in green, the linker domain in grey, and the S2 domain in khaki.
  • ...and 5 more figures

Theorems & Definitions (20)

  • Example 2.1: Gaussian Graphical Model (GGM)
  • Example 2.1: Ferromagnetic Ising Model
  • Proposition 4.0
  • Proposition 4.0
  • Proposition 4.2
  • Proposition 4.3
  • Theorem 4.4: General FDR control
  • Theorem 4.5: Power Analysis
  • Corollary 4.5: Graph feature selection under GGM
  • Corollary 4.5: Graph feature selection under Ising models
  • ...and 10 more