Table of Contents
Fetching ...

Quantifying Ranking Instability Across Evaluation Protocol Axes in Gene Regulatory Network Benchmarking

Ihor Kendiukhov

TL;DR

This work presents a systematic diagnostic framework for measuring ranking instability under protocol shift, including decomposition tools that separate base rate effects from discrimination effects and proposes concrete reporting practices for stability aware evaluation and a diagnostic toolkit for identifying method pairs at risk of reversal.

Abstract

Benchmark rankings are routinely used to justify scientific claims about method quality in gene regulatory network (GRN) inference, yet the stability of these rankings under plausible evaluation protocol choices is rarely examined. We present a systematic diagnostic framework for measuring ranking instability under protocol shift, including decomposition tools that separate base rate effects from discrimination effects. Using existing single cell GRN benchmark outputs across three human tissues and six inference methods, we quantify pairwise reversal rates across four protocol axes: candidate set restriction (16.3 percent, 95 percent CI 11.0 to 23.4 percent), tissue context (19.3 percent), reference network choice (32.1 percent), and symbol mapping policy (0.0 percent). A permutation null confirms that observed reversal rates are far below random order expectations (0.163 versus null mean 0.500), indicating partially stable but non invariant ranking structure. Our decomposition reveals that reversals are driven by changes in the relative discrimination ability of methods rather than by base rate inflation, a finding that challenges a common implicit assumption in GRN benchmarking. We propose concrete reporting practices for stability aware evaluation and provide a diagnostic toolkit for identifying method pairs at risk of reversal.

Quantifying Ranking Instability Across Evaluation Protocol Axes in Gene Regulatory Network Benchmarking

TL;DR

This work presents a systematic diagnostic framework for measuring ranking instability under protocol shift, including decomposition tools that separate base rate effects from discrimination effects and proposes concrete reporting practices for stability aware evaluation and a diagnostic toolkit for identifying method pairs at risk of reversal.

Abstract

Benchmark rankings are routinely used to justify scientific claims about method quality in gene regulatory network (GRN) inference, yet the stability of these rankings under plausible evaluation protocol choices is rarely examined. We present a systematic diagnostic framework for measuring ranking instability under protocol shift, including decomposition tools that separate base rate effects from discrimination effects. Using existing single cell GRN benchmark outputs across three human tissues and six inference methods, we quantify pairwise reversal rates across four protocol axes: candidate set restriction (16.3 percent, 95 percent CI 11.0 to 23.4 percent), tissue context (19.3 percent), reference network choice (32.1 percent), and symbol mapping policy (0.0 percent). A permutation null confirms that observed reversal rates are far below random order expectations (0.163 versus null mean 0.500), indicating partially stable but non invariant ranking structure. Our decomposition reveals that reversals are driven by changes in the relative discrimination ability of methods rather than by base rate inflation, a finding that challenges a common implicit assumption in GRN benchmarking. We propose concrete reporting practices for stability aware evaluation and provide a diagnostic toolkit for identifying method pairs at risk of reversal.
Paper Structure (38 sections, 5 equations, 6 figures, 1 table)

This paper contains 38 sections, 5 equations, 6 figures, 1 table.

Figures (6)

  • Figure 1: Pairwise reversal rates under candidate-set shifts, stratified by tissue and shift type. Each cell shows the fraction of method pairs whose ranking reverses. Immune evaluations exhibit the highest sensitivity to candidate-set restriction.
  • Figure 2: Decomposition of margin shifts into base-rate and discrimination components. Reversal cases (red) show that discrimination changes, rather than base-rate inflation, drive rank flips.
  • Figure 3: Pairwise reversal rates under tissue shifts, stratified by candidate-set type. More constrained candidate spaces amplify cross-tissue ranking instability.
  • Figure 4: Pairwise reversal rates under reference-network shifts in immune baseline evaluations. Different reference networks encode different biological evidence classes, producing high ranking instability.
  • Figure 5: Observed candidate-shift reversal rate versus permutation null distribution (5,000 permutations). The observed rate is far below random-order expectations, indicating substantial shared ranking structure that coexists with nontrivial instability.
  • ...and 1 more figures