Table of Contents
Fetching ...

Hierarchical biomarker thresholding: a model-agnostic framework for stability

O. Debeaupuis

TL;DR

The paper introduces a model-agnostic framework for stable hierarchical biomarker thresholding that yields an external-risk certificate at the realized operating point $\hat{t}$. It decomposes external risk $R_Q(\hat{t})$ into internal fit, patient-level generalization, operating-point shift, and instability, and links this decomposition to a bootstrap-based stability penalty for threshold selection. The approach enables quantile-scale ensembling and selection-honest evaluation with actionable diagnostics, while providing a monotone-invariant aggregation across methods and sites. Empirical validation on CAMELYON pathology data and MIMIC-IV-ECG demonstrates reduced external risk and fewer decision flips compared with baselines, illustrating practical deployment benefits and interpretability. The framework offers a principled, interpretable, and transport-aware methodology for threshold-based decisions in hierarchical, domain-shifted biomedical settings.

Abstract

Many biomarker pipelines require patient-level decisions aggregated from instance-level (cell/patch) scores. Thresholds tuned on pooled instances often fail across sites due to hierarchical dependence, prevalence shift, and score-scale mismatch. We present a selection-honest framework for hierarchical thresholding that makes patient-level decisions reproducible and more defensible. At its core is a risk decomposition theorem for selection-honest thresholds. The theorem separates contributions from (i) internal fit and patient-level generalization, (ii) operating-point shift reflecting prevalence and shape changes, and (iii) a stability term that penalizes sensitivity to threshold perturbations. The stability component is computable via patient-block bootstraps mapped through a monotone modulus of risk. This framework is model-agnostic, reconciles heterogeneous decision rules on a quantile scale, and yields monotone-invariant ensembles and reportable diagnostics (e.g. flip-rate, operating-point shift).

Hierarchical biomarker thresholding: a model-agnostic framework for stability

TL;DR

The paper introduces a model-agnostic framework for stable hierarchical biomarker thresholding that yields an external-risk certificate at the realized operating point . It decomposes external risk into internal fit, patient-level generalization, operating-point shift, and instability, and links this decomposition to a bootstrap-based stability penalty for threshold selection. The approach enables quantile-scale ensembling and selection-honest evaluation with actionable diagnostics, while providing a monotone-invariant aggregation across methods and sites. Empirical validation on CAMELYON pathology data and MIMIC-IV-ECG demonstrates reduced external risk and fewer decision flips compared with baselines, illustrating practical deployment benefits and interpretability. The framework offers a principled, interpretable, and transport-aware methodology for threshold-based decisions in hierarchical, domain-shifted biomedical settings.

Abstract

Many biomarker pipelines require patient-level decisions aggregated from instance-level (cell/patch) scores. Thresholds tuned on pooled instances often fail across sites due to hierarchical dependence, prevalence shift, and score-scale mismatch. We present a selection-honest framework for hierarchical thresholding that makes patient-level decisions reproducible and more defensible. At its core is a risk decomposition theorem for selection-honest thresholds. The theorem separates contributions from (i) internal fit and patient-level generalization, (ii) operating-point shift reflecting prevalence and shape changes, and (iii) a stability term that penalizes sensitivity to threshold perturbations. The stability component is computable via patient-block bootstraps mapped through a monotone modulus of risk. This framework is model-agnostic, reconciles heterogeneous decision rules on a quantile scale, and yields monotone-invariant ensembles and reportable diagnostics (e.g. flip-rate, operating-point shift).

Paper Structure

This paper contains 35 sections, 2 theorems, 22 equations, 2 figures, 6 tables, 1 algorithm.

Key Result

Theorem 4.7

Under (H1)--(H3), for any selection-honest threshold $\hat{t}$ and any $\delta_{\text{val}}\in(0,1)$, with probability at least $1-\delta_{\text{val}}$ over the validation patients, If, in addition, $\hat{t}$ is an (approximate) empirical minimizer of $\widehat{R}^{\text{val}}(\cdot)$ and $t^\ast\in\arg\min_{u}R_P(u)$, then

Figures (2)

  • Figure 1: Penalizing instability shifts threshold to a stable basin.(a) Internal risk. Approximate “true” internal risk (blue; large-sample proxy) and empirical validation risk (light blue) over thresholds. ERM selects $t^{\mathrm{ERM}}$ in a sharp basin; the robust method selects $t^{J}$ further right. (b) Instability map $\mathcal{G}_{\text{boot}}(t)$ (illustrative display). Computed from patient-level bootstrap risk curves by taking the pointwise standard deviation of the empirical risk across bootstrap replicates and multiplying by a curvature proxy $\kappa(t)$, defined as the normalized second finite difference of the bootstrap mean risk curve; the resulting signal is smoothed with a moving average and scaled to $[0,1]$ for display (see Appendix). This $\mathcal{G}_{\text{boot}}$ term is exactly the instability component used in (c). (c) Penalized objective $J(t)=\hat{R}^{\text{val}}(t)+\lambda\,\mathcal{G}_{\text{boot}}(t)$; the instability lifts the sharp basin, shifting the minimizer to $t^{J}$. Red curve correspond to P-derived upper bound. (d) Excess external risk $\Delta(t)=R_{Q}(t)-R_{P}(t)$ concentrates near sharp regions. (e) External risk $R_{Q}(t)$: the penalized threshold lowers external risk relative to ERM. (f) External risk comparison at selected thresholds. Bar plot of $R_{Q}$ at $t^{\mathrm{ERM}}$ and $t^{J}$ summarizes the improvement.
  • Figure 2: Framework validation under distribution shift (illustrative case).(a) Marginal score distributions for two markers in $P$ (blue) and $Q$ (purple) illustrate site shift. (b) Internal vs. external risk: (left) empirical $R_P(t)$ with an upper bound; (right) a $P$-frozen bound contrasted with $R_Q(t)$. (c) External risk decomposition at the selected threshold: internal empirical risk $\widehat{R}_P$, estimated prevalence/shape shifts, and the stability penalty $\mathcal{G}_{\text{boot}}$ track external risk; error bars are bootstrap s.e.; see Appendix for construction details. (d) ERM vs. penalized threshold across replicates: most points lie below $y{=}x$, indicating lower external risk with the penalty on $Q$ after freezing on $P$.

Theorems & Definitions (8)

  • Definition 4.1: Patient-level generalization term
  • Definition 4.2: Internal risk modulus
  • Remark 4.3: Oscillation form of the internal modulus
  • Remark 4.4: Conservative upper band for $\omega_P$
  • Definition 4.5: Operating-point shift: signed and magnitude gaps
  • Remark 4.6: Bounding the shift by global distances
  • Theorem 4.7: External risk: base and augmented
  • Proposition 4.9: Bootstrap upper envelope for instability