Table of Contents
Fetching ...

Weighted Diversified Sampling for Efficient Data-Driven Single-Cell Gene-Gene Interaction Discovery

Yifan Wu, Yuntao Yang, Zirui Liu, Zhao Li, Khushbu Pahwa, Rongbin Li, Wenjin Zheng, Xia Hu, Zhaozhuo Xu

TL;DR

This work presents an innovative approach utilizing data-driven computational tools, leveraging an advanced Transformer model, to unearth noteworthy gene-gene interactions, and introduces a novel weighted diversified sampling algorithm.

Abstract

Gene-gene interactions play a crucial role in the manifestation of complex human diseases. Uncovering significant gene-gene interactions is a challenging task. Here, we present an innovative approach utilizing data-driven computational tools, leveraging an advanced Transformer model, to unearth noteworthy gene-gene interactions. Despite the efficacy of Transformer models, their parameter intensity presents a bottleneck in data ingestion, hindering data efficiency. To mitigate this, we introduce a novel weighted diversified sampling algorithm. This algorithm computes the diversity score of each data sample in just two passes of the dataset, facilitating efficient subset generation for interaction discovery. Our extensive experimentation demonstrates that by sampling a mere 1\% of the single-cell dataset, we achieve performance comparable to that of utilizing the entire dataset.

Weighted Diversified Sampling for Efficient Data-Driven Single-Cell Gene-Gene Interaction Discovery

TL;DR

This work presents an innovative approach utilizing data-driven computational tools, leveraging an advanced Transformer model, to unearth noteworthy gene-gene interactions, and introduces a novel weighted diversified sampling algorithm.

Abstract

Gene-gene interactions play a crucial role in the manifestation of complex human diseases. Uncovering significant gene-gene interactions is a challenging task. Here, we present an innovative approach utilizing data-driven computational tools, leveraging an advanced Transformer model, to unearth noteworthy gene-gene interactions. Despite the efficacy of Transformer models, their parameter intensity presents a bottleneck in data ingestion, hindering data efficiency. To mitigate this, we introduce a novel weighted diversified sampling algorithm. This algorithm computes the diversity score of each data sample in just two passes of the dataset, facilitating efficient subset generation for interaction discovery. Our extensive experimentation demonstrates that by sampling a mere 1\% of the single-cell dataset, we achieve performance comparable to that of utilizing the entire dataset.

Paper Structure

This paper contains 24 sections, 2 theorems, 7 equations, 3 figures, 7 tables, 1 algorithm.

Key Result

Theorem 3.4

Given a cell dataset $X$, for every $q\in X$, we compute $w_q$ following Algorithm alg:two_pass_diverse. Next, we have $\mathbb{E}[w_q] = \sum_{x\in X} (\mathsf{Min}\text{-}\mathsf{Max}(x,q)+o(1))$, where $\mathsf{Min}\text{-}\mathsf{Max}$ is the $\mathsf{Min}\text{-}\mathsf{Max}$ similarity defined

Figures (3)

  • Figure 1: Distribution of Sequence Lengths in L6_CT Cell Type Data.
  • Figure 2: Gene-gene interaction modeling with attention maps.
  • Figure 3: Accumulating multiple cells' average attention maps.

Theorems & Definitions (8)

  • Definition 3.1: $\mathsf{Min}\text{-}\mathsf{Max}$ Similarity
  • Definition 3.2: $\mathsf{Min}\text{-}\mathsf{Max}$ Density
  • Definition 3.3: 0-bit Consistent Weighted Sampling Hash Functions li20150li2021consistent
  • Theorem 3.4: $\mathsf{Min}\text{-}\mathsf{Max}$ Density Estimator, informal version of Theorem \ref{['thm:min-max_density:formal']}
  • Definition 3.5: Inverse $\mathsf{Min}\text{-}\mathsf{Max}$ Density (IMD)
  • Definition 3.6: Estimated Interaction Score with WDS
  • Theorem B.1: $\mathsf{Min}\text{-}\mathsf{Max}$ Density Estimator, formal version of Theorem \ref{['thm:min-max_density:informal']}
  • proof