Table of Contents
Fetching ...

Bridging Binarization: Causal Inference with Dichotomized Continuous Exposures

Kaitlyn J. Lee, Alan Hubbard, Alejandro Schuler

TL;DR

This paper addresses the challenge of estimating causal effects for continuous exposures by binarizing the exposure and clarifying when the resulting binarized ATE (BATE) is a valid causal contrast. It shows that the BATE is equivalent to the difference in expected outcomes under two modified treatment policies that impose a cutoff and preserve relative self-selection, formalized through $\Psi_{BATE}=E[Y_{\tilde{A}_1}] - E[Y_{\tilde{A}_0}]$, with densities $p_{\tilde{A}_1|W}(a,w)=p_{A|W}(a,w)/\pi_{\mathcal{A}}(w)$. The authors introduce the causal attributable effect of binarization (CAB), defined as $\Psi_{CAB}=E[Y_{\tilde{A}_t}] - E[Y]$ for $t\in\{0,1\}$, which compares a post-binarization policy to the observed world and often relies on weaker identification assumptions. Estimation is feasible via regression, IPW, AIPW, or TMLE, and crucially, only the binarized exposure distribution (through $T$) is needed, not the full density of the original continuous exposure. Through simulations and an applied study of California birth outcomes near oil/gas wells, the paper demonstrates that BATE can overstate policy effects whereas CAB provides a more policy-relevant benchmark, offering practical guidance for causal inference with continuous exposures.

Abstract

The average treatment effect (ATE) is a common parameter estimated in causal inference literature, but it is only defined for binary exposures. Thus, despite concerns raised by some researchers, many studies seeking to estimate the causal effect of a continuous exposure create a new binary exposure variable by dichotomizing the continuous values into two categories. In this paper, we affirm binarization as a statistically valid method for answering causal questions about continuous exposures by showing the equivalence between the binarized ATE and the difference in the average outcomes of two specific modified treatment policies. These policies impose cut-offs corresponding to the binarized exposure variable and assume preservation of relative self-selection. Relative self-selection is the ratio of the probability density of an individual having an exposure equal to one value of the continuous exposure variable versus another. The policies assume that, for any two values of the exposure variable with non-zero probability density after the cut-off, this ratio will remain unchanged. Through this equivalence, we clarify the assumptions underlying binarization and discuss how to properly interpret the resulting estimator. Additionally, we introduce a new target parameter that can be computed after binarization that considers the observed world as a benchmark. We argue that this parameter addresses more relevant causal questions than the traditional binarized ATE parameter. We present a simulation study to illustrate the implications of these assumptions when analyzing data and to demonstrate how to correctly implement estimators of the parameters discussed. Finally, we present an application of this method to evaluate the effect of a law in the state of California which seeks to limit exposures to oil and gas wells on birth outcomes to further illustrate the underlying assumptions.

Bridging Binarization: Causal Inference with Dichotomized Continuous Exposures

TL;DR

This paper addresses the challenge of estimating causal effects for continuous exposures by binarizing the exposure and clarifying when the resulting binarized ATE (BATE) is a valid causal contrast. It shows that the BATE is equivalent to the difference in expected outcomes under two modified treatment policies that impose a cutoff and preserve relative self-selection, formalized through $\Psi_{BATE}=E[Y_{\tilde{A}_1}] - E[Y_{\tilde{A}_0}]$, with densities $p_{\tilde{A}_1|W}(a,w)=p_{A|W}(a,w)/\pi_{\mathcal{A}}(w)$. The authors introduce the causal attributable effect of binarization (CAB), defined as $\Psi_{CAB}=E[Y_{\tilde{A}_t}] - E[Y]$ for $t\in\{0,1\}$, which compares a post-binarization policy to the observed world and often relies on weaker identification assumptions. Estimation is feasible via regression, IPW, AIPW, or TMLE, and crucially, only the binarized exposure distribution (through $T$) is needed, not the full density of the original continuous exposure. Through simulations and an applied study of California birth outcomes near oil/gas wells, the paper demonstrates that BATE can overstate policy effects whereas CAB provides a more policy-relevant benchmark, offering practical guidance for causal inference with continuous exposures.

Abstract

The average treatment effect (ATE) is a common parameter estimated in causal inference literature, but it is only defined for binary exposures. Thus, despite concerns raised by some researchers, many studies seeking to estimate the causal effect of a continuous exposure create a new binary exposure variable by dichotomizing the continuous values into two categories. In this paper, we affirm binarization as a statistically valid method for answering causal questions about continuous exposures by showing the equivalence between the binarized ATE and the difference in the average outcomes of two specific modified treatment policies. These policies impose cut-offs corresponding to the binarized exposure variable and assume preservation of relative self-selection. Relative self-selection is the ratio of the probability density of an individual having an exposure equal to one value of the continuous exposure variable versus another. The policies assume that, for any two values of the exposure variable with non-zero probability density after the cut-off, this ratio will remain unchanged. Through this equivalence, we clarify the assumptions underlying binarization and discuss how to properly interpret the resulting estimator. Additionally, we introduce a new target parameter that can be computed after binarization that considers the observed world as a benchmark. We argue that this parameter addresses more relevant causal questions than the traditional binarized ATE parameter. We present a simulation study to illustrate the implications of these assumptions when analyzing data and to demonstrate how to correctly implement estimators of the parameters discussed. Finally, we present an application of this method to evaluate the effect of a law in the state of California which seeks to limit exposures to oil and gas wells on birth outcomes to further illustrate the underlying assumptions.
Paper Structure (20 sections, 21 equations, 2 figures, 3 tables)

This paper contains 20 sections, 21 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Densities of exposures $A$, $\tilde{A}_1$, and $\tilde{A}_0$ by different values of $W$. The graph on the left shows the densities when $W=0$ and graph on the right shows the densities when $W=1$. The green line represents the cut-off value of $A=6$. The gray line is the density of the observed $A$. The red dotted line is the density of $\tilde{A}_1$. The blue dotted line is the density of $\tilde{A}_0$.
  • Figure 2: Histograms for the observed distance to the nearest active well to households with pregnant Hispanic people, aged 25–29, less than high school educated with a Kotelchuk index of adequate and nulliparous, who gave birth in September 2010 to male babies. The histogram on the left is the observed distribution. The histogram on the right is the observed histogram after imposing a cut-off at 1 km; this is approximately what we assume the distribution of distance to nearest active well would look like after implementing the Health Protection Zones.