Table of Contents
Fetching ...

Detecting signal from science:The structure of research communities and prior knowledge improves prediction of genetic regulatory experiments

Alexander V. Belikov, Andrey Rzhetsky, James Evans

TL;DR

This work tackles the challenge of navigating prior knowledge and reproducibility in biomedical literature by introducing a Bayesian framework that integrates claims from GeneWays and Literome with large-scale LINCS L1000 data. It partitions gene regulatory interactions into neutral, negative, and positive classes using data-driven thresholds, and builds a rich set of interaction- and batch-level features to predict neutrality, positivity, and claim correctness. The study demonstrates that scientifically focused yet institutionally diverse activity enhances replicability and shows how policy choices that broaden research communities can improve overall predictive power and robustness. Collectively, the approach provides a scalable, data-driven way to decode bias, estimate replicability, and guide science funding toward more reliable discoveries.

Abstract

The explosive growth of scientists, scientific journals, articles and findings in recent years exponentially increases the difficulty scientists face in navigating prior knowledge. This challenge is exacerbated by uncertainty about the reproducibility of published findings. The availability of massive digital archives, machine reading and extraction tools on the one hand, and automated high-throughput experiments on the other, allow us to evaluate these challenges at scale and identify novel opportunities for accelerating scientific advance. Here we demonstrate a Bayesian calculus that enables the positive prediction of robust, replicable scientific claims with findings automatically extracted from published literature on gene interactions. We matched these findings, filtered by science, with unfiltered gene interactions measured by the massive LINCS L1000 high-throughput experiment to identify and counteract sources of bias. Our calculus is built on easily extracted publication meta-data regarding the position of a scientific claim within the web of prior knowledge, and its breadth of support across institutions, authors and communities, revealing that scientifically focused but socially and institutionally independent research activity is most likely to replicate. These findings recommend policies that go against the common practice of channeling biomedical research funding into centralized research consortia and institutes rather than dispersing it more broadly. Our results demonstrate that robust scientific findings hinge upon a delicate balance of shared focus and independence, and that this complex pattern can be computationally exploited to decode bias and predict the replicability of published findings. These insights provide guidance for scientists navigating the research literature and for science funders seeking to improve it.

Detecting signal from science:The structure of research communities and prior knowledge improves prediction of genetic regulatory experiments

TL;DR

This work tackles the challenge of navigating prior knowledge and reproducibility in biomedical literature by introducing a Bayesian framework that integrates claims from GeneWays and Literome with large-scale LINCS L1000 data. It partitions gene regulatory interactions into neutral, negative, and positive classes using data-driven thresholds, and builds a rich set of interaction- and batch-level features to predict neutrality, positivity, and claim correctness. The study demonstrates that scientifically focused yet institutionally diverse activity enhances replicability and shows how policy choices that broaden research communities can improve overall predictive power and robustness. Collectively, the approach provides a scalable, data-driven way to decode bias, estimate replicability, and guide science funding toward more reliable discoveries.

Abstract

The explosive growth of scientists, scientific journals, articles and findings in recent years exponentially increases the difficulty scientists face in navigating prior knowledge. This challenge is exacerbated by uncertainty about the reproducibility of published findings. The availability of massive digital archives, machine reading and extraction tools on the one hand, and automated high-throughput experiments on the other, allow us to evaluate these challenges at scale and identify novel opportunities for accelerating scientific advance. Here we demonstrate a Bayesian calculus that enables the positive prediction of robust, replicable scientific claims with findings automatically extracted from published literature on gene interactions. We matched these findings, filtered by science, with unfiltered gene interactions measured by the massive LINCS L1000 high-throughput experiment to identify and counteract sources of bias. Our calculus is built on easily extracted publication meta-data regarding the position of a scientific claim within the web of prior knowledge, and its breadth of support across institutions, authors and communities, revealing that scientifically focused but socially and institutionally independent research activity is most likely to replicate. These findings recommend policies that go against the common practice of channeling biomedical research funding into centralized research consortia and institutes rather than dispersing it more broadly. Our results demonstrate that robust scientific findings hinge upon a delicate balance of shared focus and independence, and that this complex pattern can be computationally exploited to decode bias and predict the replicability of published findings. These insights provide guidance for scientists navigating the research literature and for science funders seeking to improve it.

Paper Structure

This paper contains 26 sections, 17 equations, 14 figures, 2 tables.

Figures (14)

  • Figure 1: Correlation of mean claim value $\mu_\alpha$ and interaction strength $\hat{\pi}^\alpha$ from LINCS L1000 as a function of threshold on minimum claim sequence length per interaction for GeneWays (left) and Literome (right).
  • Figure 2: Claim number density for GeneWays (top panel) and Literome (bottom panel) all interaction (left) and selected positive/negative interactions.
  • Figure 3: Distance between the classes of neutral $\mathcal{C}_0$ and negative $\mathcal{C}_-$ interactions $W (g_-, g_0, \theta_-)$ (solid green line), number of of claims on the negative class $\mathcal{C}_-$$n_-(\theta_-)$ (dotted green line), as a function of $\theta_-$; distance between the classes of neutral $\mathcal{C}_0$ and positive $\mathcal{C}_+$ interactions $W (g_+, g_0, \theta_+)$, solid blue line, $n_+(\theta_+)$ (dotted blue line) as a function $\theta_+$ for GeneWays (left) and Literome (right).
  • Figure 4: GeneWays. Left: distance between neutral $\mathcal{C}_0$ and negative $\mathcal{C}_-$ interactions $W(g_0, g_-, \theta_- \theta_+)$; right: distance between neutral $\mathcal{C}_0$ and positive $\mathcal{C}_+$ interactions $W(g_0, g_+, \theta_-, \theta_+)$.
  • Figure 5: Pearson correlation heat map vector between $\pi^\alpha_0$ and interaction level features for GeneWays (top panel) and Literome (bottom panel).
  • ...and 9 more figures