Table of Contents
Fetching ...

Estimating treatment effects from single-arm trials via latent-variable modeling

Manuel Haussmann, Tran Minh Son Le, Viivi Halla-aho, Samu Kurki, Jussi V. Leinonen, Miika Koskinen, Samuel Kaski, Harri Lähdesmäki

TL;DR

This work tackles estimating treatment effects when only a single-arm trial is available, by leveraging external controls from real-world data. It introduces an identifiable latent-variable model that learns group-specific and shared latent representations, enabling both direct treatment effect estimation and patient matching without leaking post-treatment information. The model accounts for structured missingness via MNAR modeling and uses amortized variational inference to learn the predictive latent space, with identifiability guaranteed through a conditional prior and auxiliary variables. Empirical results on semi-synthetic IHDP data and a real-world dataset combining an RCT with EHRs show consistent improvements over a broad set of baselines across CATE, ATT, and survival tasks. The approach provides a practical, theoretically grounded framework for using external controls in settings where randomized trials are infeasible.

Abstract

Randomized controlled trials (RCTs) are the accepted standard for treatment effect estimation but they can be infeasible due to ethical reasons and prohibitive costs. Single-arm trials, where all patients belong to the treatment group, can be a viable alternative but require access to an external control group. We propose an identifiable deep latent-variable model for this scenario that can also account for missing covariate observations by modeling their structured missingness patterns. Our method uses amortized variational inference to learn both group-specific and identifiable shared latent representations, which can subsequently be used for {\em (i)} patient matching if treatment outcomes are not available for the treatment group, or for {\em (ii)} direct treatment effect estimation assuming outcomes are available for both groups. We evaluate the model on a public benchmark as well as on a data set consisting of a published RCT study and real-world electronic health records. Compared to previous methods, our results show improved performance both for direct treatment effect estimation as well as for effect estimation via patient matching.

Estimating treatment effects from single-arm trials via latent-variable modeling

TL;DR

This work tackles estimating treatment effects when only a single-arm trial is available, by leveraging external controls from real-world data. It introduces an identifiable latent-variable model that learns group-specific and shared latent representations, enabling both direct treatment effect estimation and patient matching without leaking post-treatment information. The model accounts for structured missingness via MNAR modeling and uses amortized variational inference to learn the predictive latent space, with identifiability guaranteed through a conditional prior and auxiliary variables. Empirical results on semi-synthetic IHDP data and a real-world dataset combining an RCT with EHRs show consistent improvements over a broad set of baselines across CATE, ATT, and survival tasks. The approach provides a practical, theoretically grounded framework for using external controls in settings where randomized trials are infeasible.

Abstract

Randomized controlled trials (RCTs) are the accepted standard for treatment effect estimation but they can be infeasible due to ethical reasons and prohibitive costs. Single-arm trials, where all patients belong to the treatment group, can be a viable alternative but require access to an external control group. We propose an identifiable deep latent-variable model for this scenario that can also account for missing covariate observations by modeling their structured missingness patterns. Our method uses amortized variational inference to learn both group-specific and identifiable shared latent representations, which can subsequently be used for {\em (i)} patient matching if treatment outcomes are not available for the treatment group, or for {\em (ii)} direct treatment effect estimation assuming outcomes are available for both groups. We evaluate the model on a public benchmark as well as on a data set consisting of a published RCT study and real-world electronic health records. Compared to previous methods, our results show improved performance both for direct treatment effect estimation as well as for effect estimation via patient matching.
Paper Structure (79 sections, 46 equations, 5 figures, 8 tables)

This paper contains 79 sections, 46 equations, 5 figures, 8 tables.

Figures (5)

  • Figure 1: Overview. For the task of treatment effect estimation from single-arm trial data, patient information for a control group has to be extracted from electronic health records which have been collected, e.g., during hospital visits. Due to their different sources the two covariate distributions only partially overlap. Our model maps them into group-specific latent spaces and a shared, identifiable predictive space. This low-dimensional representation can then be subsequently used for treatment effect estimation. If outcome information is available for both groups, we can obtain a direct estimate of the effect from the potential outcomes ($Y(0)$, $Y(1)$). We introduce an additional task where treatment outcome information is available only for the control group. In this scenario our method estimates the treatment effect via patient matching.
  • Figure 2: Plate Diagram of Our Model. Black solid arrows denote the generative model, red dashed arrows the inferential dependency. Empty, partially filled, and filled circles refer to latent, partially, and fully observed variables.
  • Figure 3: Missingness. Moving from a scenario where all covariates are observed ($\blacklozenge$) to one with structured missingness that is not modeled ($\blacklozenge$) reduces performance, as expected. Modeling the MNAR pattern as part of our generative model ($\blacklozenge$) reliably improves performance in all of our variants. Visualized is the RMSE of within-sample CATE estimation in the all+high scenario. Shown are 100 random replications along with their respective means ($\blacklozenge$).
  • Figure 4: Survival. Kaplan-Meier estimates of survival curves for three of the models for a single random seed. During training survival information for the treatment group is unavailable and we observe survival times only for the control group. After patient matching, the matched subset is compared against the unknown, counterfactual survival curve of the single-arm group. The shaded areas are the 95% confidence intervals on the respective Kaplan-Meier estimators. While the match found by CFRNet overlaps completely with the unmatched control group, TARNet can get close to the desired survival curve. Ours largely overlaps with it.
  • Figure 5: Missing not at random model. Empty, partially filled, and filled circles represent latent, partially observed and observed variables. $\tilde{\boldsymbol{\mathbf{x}}}$ is deterministic as described in the text.