Comparing methods to assess treatment effect heterogeneity in general parametric regression models
Yao Chen, Sophie Sun, Konstantinos Sechidis, Cong Zhang, Torsten Hothorn, Björn Bornkamp
TL;DR
This paper addresses treatment effect heterogeneity (TEH) within general regression models for randomized trials, emphasizing that TEH estimands depend on model specification and outcome scale. It centers on score-residual methods, deriving $s_i = \left.\\frac{\\partial}{\\partial \\delta} \\Psi(\\widehat{\\boldsymbol{\\beta}}, \\delta \\mid y_i, \\mathbf{x}_i)\\right|_{\\delta=\\widehat{\\delta}}$ and employing a centered treatment indicator $\\tilde{z}_i = z_i - E(z_i\mid \\mathbf{x}_i)$ to form global TEH tests via permutation, as well as variable importance analyses using residuals. The methodology is extended to count and time-to-event outcomes, with comparisons to standard global tests (likelihood ratio with bootstrap and asymptotic, Goeman's global test) and MOB partitioning. Through extensive simulations (continuous, binary, count, and time-to-event) and an applied cardiovascular time-to-event example, the authors find that score-residual-based methods—especially with a centered indicator—offer practical, flexible, and reliable TEH assessment, while traditional LR tests may be poorly calibrated in some settings; MOB and Goeman's test provide competitive performance depending on scenario. The study yields practical guidance for TEH exploration in regulatory contexts, including recommendations on centering, test statistics (maximum vs quadratic), and penalized estimation of prognostic effects to stabilize estimands and enhance robustness.
Abstract
This paper reviews and compares methods to assess treatment effect heterogeneity in the context of parametric regression models. These methods include the standard likelihood ratio tests, bootstrap likelihood ratio tests, and Goeman's global test motivated by testing whether the random effect variance is zero. We place particular emphasis on tests based on the score-residual of the treatment effect and explore different variants of tests in this class. All approaches are compared in a simulation study, and the approach based on residual scores is illustrated in a clinical trial with time-to-event outcome comparing treatment versus placebo. Our findings demonstrate that score-residual based methods provide practical, flexible and reliable tools for exploring treatment effect heterogeneity and treatment effect modifiers, and can provide useful guidance for decision making around treatment effect heterogeneity.
