Table of Contents
Fetching ...

Domain constraints improve risk prediction when outcome data is missing

Sidhika Balachandar, Nikhil Garg, Emma Pierson

TL;DR

The paper tackles risk prediction under selective-labels settings where outcomes are only observed for historically tested individuals, yielding distribution shifts between tested and untested groups. It introduces a Bayesian model with a risk score $r_i = X_i^T\beta_Y + Z_i$ and testing depends on $\alpha r_i$ plus a domain-based adjustment, incorporating two constraints: known prevalence $\mathbb{E}[Y]$ and a restricted effect of some features on testing (the expertise constraint). The authors prove that these domain constraints do not worsen—and can strictly improve—posterior precision, and they demonstrate this both theoretically (via a Heckman-model connection and variance-reduction results) and empirically (through synthetic experiments and a real breast cancer case study). In the case study on UK Biobank data, the model’s inferred risks align with cancer diagnoses, unobservables correlate with known unobservables like family history, and the inferred testing policies reflect public-health norms, with the prevalence constraint yielding more plausible inferences. Overall, the work shows how domain constraints mitigate bias and variance in selective-label settings and suggests broad applicability beyond healthcare.

Abstract

Machine learning models are often trained to predict the outcome resulting from a human decision. For example, if a doctor decides to test a patient for disease, will the patient test positive? A challenge is that historical decision-making determines whether the outcome is observed: we only observe test outcomes for patients doctors historically tested. Untested patients, for whom outcomes are unobserved, may differ from tested patients along observed and unobserved dimensions. We propose a Bayesian model class which captures this setting. The purpose of the model is to accurately estimate risk for both tested and untested patients. Estimating this model is challenging due to the wide range of possibilities for untested patients. To address this, we propose two domain constraints which are plausible in health settings: a prevalence constraint, where the overall disease prevalence is known, and an expertise constraint, where the human decision-maker deviates from purely risk-based decision-making only along a constrained feature set. We show theoretically and on synthetic data that domain constraints improve parameter inference. We apply our model to a case study of cancer risk prediction, showing that the model's inferred risk predicts cancer diagnoses, its inferred testing policy captures known public health policies, and it can identify suboptimalities in test allocation. Though our case study is in healthcare, our analysis reveals a general class of domain constraints which can improve model estimation in many settings.

Domain constraints improve risk prediction when outcome data is missing

TL;DR

The paper tackles risk prediction under selective-labels settings where outcomes are only observed for historically tested individuals, yielding distribution shifts between tested and untested groups. It introduces a Bayesian model with a risk score and testing depends on plus a domain-based adjustment, incorporating two constraints: known prevalence and a restricted effect of some features on testing (the expertise constraint). The authors prove that these domain constraints do not worsen—and can strictly improve—posterior precision, and they demonstrate this both theoretically (via a Heckman-model connection and variance-reduction results) and empirically (through synthetic experiments and a real breast cancer case study). In the case study on UK Biobank data, the model’s inferred risks align with cancer diagnoses, unobservables correlate with known unobservables like family history, and the inferred testing policies reflect public-health norms, with the prevalence constraint yielding more plausible inferences. Overall, the work shows how domain constraints mitigate bias and variance in selective-label settings and suggests broad applicability beyond healthcare.

Abstract

Machine learning models are often trained to predict the outcome resulting from a human decision. For example, if a doctor decides to test a patient for disease, will the patient test positive? A challenge is that historical decision-making determines whether the outcome is observed: we only observe test outcomes for patients doctors historically tested. Untested patients, for whom outcomes are unobserved, may differ from tested patients along observed and unobserved dimensions. We propose a Bayesian model class which captures this setting. The purpose of the model is to accurately estimate risk for both tested and untested patients. Estimating this model is challenging due to the wide range of possibilities for untested patients. To address this, we propose two domain constraints which are plausible in health settings: a prevalence constraint, where the overall disease prevalence is known, and an expertise constraint, where the human decision-maker deviates from purely risk-based decision-making only along a constrained feature set. We show theoretically and on synthetic data that domain constraints improve parameter inference. We apply our model to a case study of cancer risk prediction, showing that the model's inferred risk predicts cancer diagnoses, its inferred testing policy captures known public health policies, and it can identify suboptimalities in test allocation. Though our case study is in healthcare, our analysis reveals a general class of domain constraints which can improve model estimation in many settings.
Paper Structure (50 sections, 6 theorems, 24 equations, 13 figures)

This paper contains 50 sections, 6 theorems, 24 equations, 13 figures.

Key Result

Proposition 3.0

The Heckman model (Definition heckman_model) is equivalent to the following special case of the general model in equation eq:DGP:

Figures (13)

  • Figure 1: Effect of $\alpha$ and $X\boldsymbol{\beta_\Delta}$: $\alpha$ controls how steeply testing probability $p(T_i)$ increases in disease risk $p(Y_i)$, while $X\boldsymbol{\beta_\Delta}$ captures factors which affect $p(T_i)$ when controlling for $p(Y_i)$.
  • Figure 2: The prevalence and expertise constraints each produce more precise and accurate inferences on synthetic data drawn from the Bernoulli-sigmoid model with uniform noise (equation \ref{['eq:uniform_model']}). To quantify precision (left), we report the percent reduction in 95% confidence interval width as compared to the unconstrained model. To quantify accuracy (right), we report the percent reduction in posterior mean error --- i.e., the absolute difference between the posterior mean and the true parameter value --- as compared to the unconstrained model. We plot the median across 200 synthetic datasets. Error bars denote the bootstrapped 95% confidence interval on the median.
  • Figure 3: Estimated $\boldsymbol{\beta_Y}$ (top) capture known cancer risk factors: genetic risk, previous biopsy, age at first period (menarche), and age nih_risk_toolyanes2020clinical. Estimated $\boldsymbol{\beta_\Delta}$ (bottom) capture the underuse of genetic information (left) and known age-based testing policies (right). Points indicate posterior means and vertical lines indicate 95% confidence intervals. Gray asterisks indicate coefficients set to 0 by the expertise constraint.
  • Figure 4: Without the prevalence constraint, the model learns that cancer risk first increases and then decreases with age (left orange), contradicting prior literature cancer_risk_predictors2cr_prevalence_statsus2013uscampisi2013aging. This incorrect inference occurs because the tested population has the same misleading age trend (right). In contrast, the prevalence constraint encodes that the (younger) untested population has lower risk, allowing the model to learn a more accurate age trend (left blue).
  • Figure S1: Results using synthetic data from the Heckman model. The prevalence and expertise constraints each produce more precise and accurate inferences on this synthetic data. We plot the median across 200 synthetic datasets. Errorbars denote the bootstrapped 95% confidence interval on the median.
  • ...and 8 more figures

Theorems & Definitions (13)

  • Definition 1: Heckman correction model
  • Proposition 3.0
  • Definition 2: Expected conditional variance
  • Proposition 3.0
  • Definition 2: Heckman correction model
  • Proposition B.0
  • proof
  • Definition 2: Expected conditional variance
  • Proposition B.0
  • proof
  • ...and 3 more