Inference for relative sparsity

Samuel J. Weisenthal; Sally W. Thurston; Ashkan Ertefaie

Inference for relative sparsity

Samuel J. Weisenthal, Sally W. Thurston, Ashkan Ertefaie

TL;DR

This work tackles inference for relative sparsity-constrained, multi-stage policies in healthcare by embedding a weighted Trust Region Policy Optimization (TRPO) base objective inside a relative sparsity penalty and using adaptive Lasso with sample splitting. By imposing a KL-constrained baseline and an adaptive penalty, the estimand becomes finite and amenable to inference, even when the optimal policy is deterministic and would otherwise yield unbounded parameters. The authors develop consistent, asymptotically normal estimators for the policy coefficients and the value, provide post-selection inference procedures, and validate them through simulations and a real MIMIC-III vasopressor dataset analysis, resulting in a practically sparse, interpretable policy with valid uncertainty quantification. The framework supports safer translation of data-driven decisions into clinical practice by quantifying uncertainty and enabling transparent, sparse explanations of how the new policy diverges from standard of care.

Abstract

In healthcare, there is much interest in estimating policies, or mappings from covariates to treatment decisions. Recently, there is also interest in constraining these estimated policies to the standard of care, which generated the observed data. A relative sparsity penalty was proposed to derive policies that have sparse, explainable differences from the standard of care, facilitating justification of the new policy. However, the developers of this penalty only considered estimation, not inference. Here, we develop inference for the relative sparsity objective function, because characterizing uncertainty is crucial to applications in medicine. Further, in the relative sparsity work, the authors only considered the single-stage decision case; here, we consider the more general, multi-stage case. Inference is difficult, because the relative sparsity objective depends on the unpenalized value function, which is unstable and has infinite estimands in the binary action case. Further, one must deal with a non-differentiable penalty. To tackle these issues, we nest a weighted Trust Region Policy Optimization function within a relative sparsity objective, implement an adaptive relative sparsity penalty, and propose a sample-splitting framework for post-selection inference. We study the asymptotic behavior of our proposed approaches, perform extensive simulations, and analyze a real, electronic health record dataset.

Inference for relative sparsity

TL;DR

Abstract

Paper Structure (59 sections, 6 theorems, 139 equations, 3 figures, 3 tables, 1 algorithm)

This paper contains 59 sections, 6 theorems, 139 equations, 3 figures, 3 tables, 1 algorithm.

Introduction
Notation
Background
Markov decision processes (MDPs)
On inference with Trust Region Policy Optimization (TRPO)
Methodological Contributions
Adding (adaptive) relative sparsity to Trust Region Policy Optimization (TRPO)
Sample splitting in the relative sparsity framework
Estimation
Value
Trust Region Policy Optimization (TRPO)
Adaptive relative sparsity
Tuning parameters
Theory
Assumptions
...and 44 more sections

Key Result

Lemma 1

We have that ${M}_n$ is consistent for $M_0,$ or that ${M}_n \overset{p}{\to} M_0.$

Figures (3)

Figure 1: Selection diagrams for the simulated data ($n_{train}=250,n_{test}=250$). Over increasing $\gamma$ (going down) and $\delta$ (going right), we show the average coefficients in the suggested ($\beta_{n,\gamma,\lambda})$ and behavioral ($b_n$) policies, the average difference in probability of treatment between the two policies, and the average value ($V_n$) for the suggested policy, all of which were computed in the first split of the data . The dotted vertical line indicates $\lambda_n$, a choice of $\lambda$ based on (\ref{['lambda.crit.diff']}). Note that the average suggested policy probability of treatment is $\pi_{sugg}=(1/nT)\sum_i\sum_t{\pi}_{\beta_{n,\gamma,\lambda}}(A_{i,t}=1|s_{i,t})$ and vice versa for $\pi_{beh}$. The shaded regions in the coefficient ($\beta_{n,\gamma,\lambda}$) panels correspond to (\ref{['eq:sigma2n']}), and the dotted lines show one standard error estimated empirically. The shaded regions in the value panels show one standard error based on (\ref{['valuevar']}), which was used to select $\lambda_n$ using (\ref{['lambda.crit.diff']}), and the dotted lines show one standard error estimated empirically.
Figure 2: Selection diagrams for the real data ($n_{train}=1,176,n_{test}=1,176$). Over increasing $\gamma$ (going down) and $\delta$ (going right), we show the average coefficients in the suggested ($\beta_{n,\gamma,\lambda})$ and behavioral ($b_n$) policies, the average difference in probability of treatment between the two policies, and the average value ($V_n$) for the suggested policy, all of which were computed in the first split of the data . The dotted vertical line indicates, $\lambda_n$, a choice of $\lambda$ based on (\ref{['lambda.crit.diff']}). Note that the average suggested policy probability of treatment is $\pi_{sugg}=(1/nT)\sum_i\sum_t{\pi}_{\beta_{n,\gamma,\lambda}}(A_{i,t}=1|s_{i,t})$ and vice versa for $\pi_{beh}$. The shaded regions in the coefficient ($\beta_{n,\gamma,\lambda}$) panels correspond to (\ref{['eq:sigma2n']}) (to declutter the plot, and because MAP was the only selected covariate, we show this only for MAP). The shaded regions in the value panels show one standard error based on (\ref{['valuevar']}), which was used to select $\lambda_n$ based on (\ref{['lambda.crit.diff']}).
Figure A.1: Real data calibration curve for the behavioral policy. We show a calibration curve for the real data, in which the estimated and observed probabilities are compared. The plot is generated based on the predtools R package. We resample the real data, each time training on one half and then generating a calibration curve for the test data, and then we finally average these test set calibration curves (generated by the R package "predtools"). The behavioral policy is stationary by Assumption \ref{['assum:stationarity']}, so we just treat action-state observations at different time steps as independent (the confidence intervals might be too narrow, but mostly we are focused on the fact that the point estimates are approximately near the identity here).

Theorems & Definitions (18)

Remark 1
Remark 2
Remark 3
Remark 4
Remark 5
Lemma 1
proof
Lemma 2
proof
Lemma 3
...and 8 more

Inference for relative sparsity

TL;DR

Abstract

Inference for relative sparsity

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (18)