Table of Contents
Fetching ...

Comparing methods to assess treatment effect heterogeneity in general parametric regression models

Yao Chen, Sophie Sun, Konstantinos Sechidis, Cong Zhang, Torsten Hothorn, Björn Bornkamp

TL;DR

This paper addresses treatment effect heterogeneity (TEH) within general regression models for randomized trials, emphasizing that TEH estimands depend on model specification and outcome scale. It centers on score-residual methods, deriving $s_i = \left.\\frac{\\partial}{\\partial \\delta} \\Psi(\\widehat{\\boldsymbol{\\beta}}, \\delta \\mid y_i, \\mathbf{x}_i)\\right|_{\\delta=\\widehat{\\delta}}$ and employing a centered treatment indicator $\\tilde{z}_i = z_i - E(z_i\mid \\mathbf{x}_i)$ to form global TEH tests via permutation, as well as variable importance analyses using residuals. The methodology is extended to count and time-to-event outcomes, with comparisons to standard global tests (likelihood ratio with bootstrap and asymptotic, Goeman's global test) and MOB partitioning. Through extensive simulations (continuous, binary, count, and time-to-event) and an applied cardiovascular time-to-event example, the authors find that score-residual-based methods—especially with a centered indicator—offer practical, flexible, and reliable TEH assessment, while traditional LR tests may be poorly calibrated in some settings; MOB and Goeman's test provide competitive performance depending on scenario. The study yields practical guidance for TEH exploration in regulatory contexts, including recommendations on centering, test statistics (maximum vs quadratic), and penalized estimation of prognostic effects to stabilize estimands and enhance robustness.

Abstract

This paper reviews and compares methods to assess treatment effect heterogeneity in the context of parametric regression models. These methods include the standard likelihood ratio tests, bootstrap likelihood ratio tests, and Goeman's global test motivated by testing whether the random effect variance is zero. We place particular emphasis on tests based on the score-residual of the treatment effect and explore different variants of tests in this class. All approaches are compared in a simulation study, and the approach based on residual scores is illustrated in a clinical trial with time-to-event outcome comparing treatment versus placebo. Our findings demonstrate that score-residual based methods provide practical, flexible and reliable tools for exploring treatment effect heterogeneity and treatment effect modifiers, and can provide useful guidance for decision making around treatment effect heterogeneity.

Comparing methods to assess treatment effect heterogeneity in general parametric regression models

TL;DR

This paper addresses treatment effect heterogeneity (TEH) within general regression models for randomized trials, emphasizing that TEH estimands depend on model specification and outcome scale. It centers on score-residual methods, deriving and employing a centered treatment indicator to form global TEH tests via permutation, as well as variable importance analyses using residuals. The methodology is extended to count and time-to-event outcomes, with comparisons to standard global tests (likelihood ratio with bootstrap and asymptotic, Goeman's global test) and MOB partitioning. Through extensive simulations (continuous, binary, count, and time-to-event) and an applied cardiovascular time-to-event example, the authors find that score-residual-based methods—especially with a centered indicator—offer practical, flexible, and reliable TEH assessment, while traditional LR tests may be poorly calibrated in some settings; MOB and Goeman's test provide competitive performance depending on scenario. The study yields practical guidance for TEH exploration in regulatory contexts, including recommendations on centering, test statistics (maximum vs quadratic), and penalized estimation of prognostic effects to stabilize estimands and enhance robustness.

Abstract

This paper reviews and compares methods to assess treatment effect heterogeneity in the context of parametric regression models. These methods include the standard likelihood ratio tests, bootstrap likelihood ratio tests, and Goeman's global test motivated by testing whether the random effect variance is zero. We place particular emphasis on tests based on the score-residual of the treatment effect and explore different variants of tests in this class. All approaches are compared in a simulation study, and the approach based on residual scores is illustrated in a clinical trial with time-to-event outcome comparing treatment versus placebo. Our findings demonstrate that score-residual based methods provide practical, flexible and reliable tools for exploring treatment effect heterogeneity and treatment effect modifiers, and can provide useful guidance for decision making around treatment effect heterogeneity.

Paper Structure

This paper contains 28 sections, 8 equations, 15 figures, 3 tables.

Figures (15)

  • Figure 1: Overview of WATCH workflow and the four main steps.
  • Figure 2: Plot of score residuals for models fitted to two data-sets versus the covariate $x_1$, corresponding to situations of no TEH and heterogeneous treatment effects. Model M1-M6 is in their corresponding columns. Kendall's correlation of $x_1$ and the score residual as well as the $p$-value for a global heterogeneity test are shown in the top left. The red points (treatment = 0) and blue points (treatment = 1) are the score residual for different treatment group, and the gray curve is the non-linear fitting of score residual versus $x_1$.
  • Figure 3: ECDF for $p$-values (F(p-value)) for each method under null hypothesis, i.e. no heterogeneity. A good method should have $p$-value uniformly distributed along 0, 1, with an ECDF line follows straight diagonal line.
  • Figure 4: Median surprise value (-log2($p$-value)) under various treatment effect heterogeneity, a large median surprise value is corresponding to a large power to detect treatment effect heterogeneity. The black dashed line is representing surprise value of 1, which is corresponding to $p$-value of 0.5.
  • Figure 5: Probability of top selected variable from variable importance being predictive
  • ...and 10 more figures