Table of Contents
Fetching ...

Conformalized Survival Distributions: A Generic Post-Process to Increase Calibration

Shi-ang Qi, Yakun Yu, Russell Greiner

TL;DR

This paper tackles the persistence of calibration gaps in survival analysis without sacrificing discriminative accuracy. It introduces Conformalized Survival Distribution (CSD), a model-agnostic post-processing framework that uses conformal regression on discretized percentile times to recalibrate survival distributions while preserving discrimination. The authors prove theoretical guarantees—including distribution calibration and KM calibration—and validate the method on 11 real-world datasets, showing robust calibration gains with minimal or no loss in C-index. They also compare CSD to objective-based calibration approaches, demonstrate the benefits of KM-sampling for censoring, and provide practical guidance and code for practitioners. The work offers a scalable, theoretically grounded approach to producing reliable, calibrated survival distributions in the presence of censoring, with broad applicability in clinical decision making and resource allocation.

Abstract

Discrimination and calibration represent two important properties of survival analysis, with the former assessing the model's ability to accurately rank subjects and the latter evaluating the alignment of predicted outcomes with actual events. With their distinct nature, it is hard for survival models to simultaneously optimize both of them especially as many previous results found improving calibration tends to diminish discrimination performance. This paper introduces a novel approach utilizing conformal regression that can improve a model's calibration without degrading discrimination. We provide theoretical guarantees for the above claim, and rigorously validate the efficiency of our approach across 11 real-world datasets, showcasing its practical applicability and robustness in diverse scenarios.

Conformalized Survival Distributions: A Generic Post-Process to Increase Calibration

TL;DR

This paper tackles the persistence of calibration gaps in survival analysis without sacrificing discriminative accuracy. It introduces Conformalized Survival Distribution (CSD), a model-agnostic post-processing framework that uses conformal regression on discretized percentile times to recalibrate survival distributions while preserving discrimination. The authors prove theoretical guarantees—including distribution calibration and KM calibration—and validate the method on 11 real-world datasets, showing robust calibration gains with minimal or no loss in C-index. They also compare CSD to objective-based calibration approaches, demonstrate the benefits of KM-sampling for censoring, and provide practical guidance and code for practitioners. The work offers a scalable, theoretically grounded approach to producing reliable, calibrated survival distributions in the presence of censoring, with broad applicability in clinical decision making and resource allocation.

Abstract

Discrimination and calibration represent two important properties of survival analysis, with the former assessing the model's ability to accurately rank subjects and the latter evaluating the alignment of predicted outcomes with actual events. With their distinct nature, it is hard for survival models to simultaneously optimize both of them especially as many previous results found improving calibration tends to diminish discrimination performance. This paper introduces a novel approach utilizing conformal regression that can improve a model's calibration without degrading discrimination. We provide theoretical guarantees for the above claim, and rigorously validate the efficiency of our approach across 11 real-world datasets, showcasing its practical applicability and robustness in diverse scenarios.
Paper Structure (68 sections, 10 theorems, 54 equations, 19 figures, 3 tables, 1 algorithm)

This paper contains 68 sections, 10 theorems, 54 equations, 19 figures, 3 tables, 1 algorithm.

Key Result

Theorem 3.1

Applying the CSD adjustment to the percentile predictions does not affect the C-index of the model, regardless of whether to use the negative of the median or of the mean survival times as the predicted risk scores.

Figures (19)

  • Figure 1: Comparison of distribution calibration (upper part) and single-time calibration (lower part) using four uncensored patients. The leftmost (a) includes the predicted survival distributions (solid curves) and the true labels (stars) for four patients. The upper section shows that D-cal process: (b) obtains the predicted probabilities of each patient's event time; (c) constructs a histogram for these probabilities, where an ideally calibrated model would yield a uniformly shaped histogram; and (d) computes a probability–probability (P-P) plot assessing the histogram's uniformity. The lower section (1-cal) process: (e) obtains the predicted probabilities at a target time $t^*$ and groups the patients based on these sorted probabilities; (f) calculates average predicted and observed probabilities for each group, which should show statistical similarity; and (g) computes a P-P plot visualizing the similarity between predicted and observed probabilities.
  • Figure 2: Example of using Conformal Survival Distribution (CSD) to make the prediction D-calibrated, using the same patients and predictions shown in Figure \ref{['fig:calibration_illustration']}. (a) Discretize the predicted survival distributions at three percentile levels (25%, 50%, 75%); (b) Generate the new ISD by adjusting the PCTs, where the hollow points are the old PCTs; (c) Calculate the D-cal histogram using the adjusted ISDs; (d) P-P plot comparing the D-cal level between non-CSD and CSD predictions.
  • Figure 3: An example of KM sampling from METABRIC dataset.
  • Figure 4: Parts of the empirical results. The error bars represent mean and 95% CI over ten runs, with blue denoting the non-CSD baseline and orange for the CSD-version. The red dashed lines represent the mean calibration performance for KM estimator, serving as an empirical lower-limit. A higher C-index score indicates better performance, whereas lower scores are preferable for the other metrics.
  • Figure 5: Compare CSD with objective-based methods on a deep log-normal model. The baseline (blue) uses likelihood loss (LL). For X-cal and SFM methods, we gradually increase the weight for the calibration loss. The red dashed lines serve as the empirical lower-limit.
  • ...and 14 more figures

Theorems & Definitions (20)

  • Theorem 3.1
  • Theorem 3.2
  • Lemma 3.3
  • Remark 3.4
  • Remark 3.5
  • Remark 3.6
  • Proposition 1.1
  • proof
  • Theorem 2.1
  • proof
  • ...and 10 more