Table of Contents
Fetching ...

The Certainty Bound: Structural Limits on Scientific Reliability

Marco Pollanen

Abstract

Explanations of the replication crisis often emphasize misconduct, questionable research practices, or incentive misalignment, implying that behavioral reform is sufficient. This paper argues that a substantial component is architectural: within binary significance-based publication systems, even perfectly diligent researchers face structural limits on the reliability they can deliver. The posterior log-odds of a finding equal prior log-odds plus log(Lambda), where Lambda = (1-beta)/alpha is the experimental leverage. Interpreted architecturally, this implies a hard constraint: once evidence is coarsened to a binary significance decision, the decision rule contributes exactly log(Lambda) to posterior log-odds. A target reliability tau is feasible iff pi >= pi_crit, and under fixed alpha this generally cannot be rescued by sample size alone. Two mechanisms can drive effective leverage to 1 without bad faith: persistent unmeasured confounding in observational studies and unbounded specification search under publication pressure. These results concern binary significance-based decision architectures and do not bound inference based on full likelihoods or richer continuous evidence summaries. Two collapse results formalize these mechanisms, while the Replication Pipeline Theorem and Minimum Pipeline Depth Corollary identify a quantitative evidentiary standard for escape. Using independently documented parameters for pre-reform psychology (pi about 0.10, power about 0.35), the framework implies a replication rate of 36%, consistent with the Open Science Collaboration. The framework also provides quantitative bridges to Popper, Kuhn, and Lakatos. In low-prior settings below the single-study feasibility threshold, the natural unit of evidence is the replication pipeline rather than the individual experiment.

The Certainty Bound: Structural Limits on Scientific Reliability

Abstract

Explanations of the replication crisis often emphasize misconduct, questionable research practices, or incentive misalignment, implying that behavioral reform is sufficient. This paper argues that a substantial component is architectural: within binary significance-based publication systems, even perfectly diligent researchers face structural limits on the reliability they can deliver. The posterior log-odds of a finding equal prior log-odds plus log(Lambda), where Lambda = (1-beta)/alpha is the experimental leverage. Interpreted architecturally, this implies a hard constraint: once evidence is coarsened to a binary significance decision, the decision rule contributes exactly log(Lambda) to posterior log-odds. A target reliability tau is feasible iff pi >= pi_crit, and under fixed alpha this generally cannot be rescued by sample size alone. Two mechanisms can drive effective leverage to 1 without bad faith: persistent unmeasured confounding in observational studies and unbounded specification search under publication pressure. These results concern binary significance-based decision architectures and do not bound inference based on full likelihoods or richer continuous evidence summaries. Two collapse results formalize these mechanisms, while the Replication Pipeline Theorem and Minimum Pipeline Depth Corollary identify a quantitative evidentiary standard for escape. Using independently documented parameters for pre-reform psychology (pi about 0.10, power about 0.35), the framework implies a replication rate of 36%, consistent with the Open Science Collaboration. The framework also provides quantitative bridges to Popper, Kuhn, and Lakatos. In low-prior settings below the single-study feasibility threshold, the natural unit of evidence is the replication pipeline rather than the individual experiment.
Paper Structure (55 sections, 23 theorems, 22 equations, 2 figures, 4 tables)

This paper contains 55 sections, 23 theorems, 22 equations, 2 figures, 4 tables.

Key Result

Theorem 7

For any statistical test with leverage $\Lambda$ applied to hypotheses with prior $\pi \in (0,1)$: Equivalently, in log-odds form: Achieving $\mathop{\mathrm{PPV}}\nolimits \geq \tau$ requires

Figures (2)

  • Figure 1: PPV as a function of prior probability under two significance thresholds. The darkly shaded region ($\pi < 0.059$) is the majority-false regime at conventional parameters. The lightly shaded region ($0.059 < \pi < 0.543$) is infeasible for target $\tau = 0.95$ at $\alpha = 0.05$. Tightening $\alpha$ to 0.005 shifts the critical prior to 0.106. The chord between $\pi = 0.02$ and $\pi = 0.18$ lies below the curve, visualizing the heterogeneity tax.
  • Figure 2: Reliability landscape for scientific fields. Points represent illustrative calibrations, not precise measurements; their purpose is regime visualization. The dashed line is the feasibility boundary for $\tau = 0.95$ ($\Psi = 1$): fields above and to the right are feasible. The dotted line is the majority-false boundary. Blue: majority-false exemplars. Red: improved but still infeasible. Green: feasible.

Theorems & Definitions (58)

  • Definition 1: State Space and Prior
  • Definition 2: Test Outcomes and Error Rates
  • Remark 3: Nominal versus Effective Error Rates
  • Definition 4: Experimental Leverage
  • Remark 5: Minimal Discrimination and Sign Reversal
  • Definition 6: Positive Predictive Value
  • Theorem 7: Certainty Bound
  • proof
  • Example 8: Required Leverage
  • Theorem 9: Fixed-$\alpha$ Ceiling
  • ...and 48 more