Efficient Subgroup Analysis via Optimal Trees with Global Parameter Fusion
Zhongming Xie, Joseph Giorgio, Jingshen Wang
TL;DR
The paper tackles heterogeneity in treatment effects within clinical trials by introducing a fused optimal causal tree (FOCT) that uses mixed-integer optimization to obtain globally optimal subgroup partitions while enforcing parameter fusion across related subgroups. This fusion constraint promotes information sharing and improves statistical efficiency, addressing instability and overfitting that plague greedy tree methods, especially with small samples and rare alleles. The authors provide theoretical risk bounds showing near-optimal convergence for the FOCT, compare against CART, and demonstrate superior performance in simulations across varying correlations and sample sizes. The case study on the Health and Aging Brain Study–Health Disparities (HABS-HD) shows FOCT yielding clinically meaningful subgroup insights and interpretable covariate effects, illustrating its potential for precision medicine in AD contexts.
Abstract
Identifying and making statistical inferences on differential treatment effects (commonly known as subgroup analysis in clinical research) is central to precision health. Subgroup analysis allows practitioners to pinpoint populations for whom a treatment is especially beneficial or protective, thereby advancing targeted interventions. Tree based recursive partitioning methods are widely used for subgroup analysis due to their interpretability. Nevertheless, these approaches encounter significant limitations, including suboptimal partitions induced by greedy heuristics and overfitting from locally estimated splits, especially under limited sample sizes. To address these limitations, we propose a fused optimal causal tree method that leverages mixed integer optimization (MIO) to facilitate precise subgroup identification. Our approach ensures globally optimal partitions and introduces a parameter fusion constraint to facilitate information sharing across related subgroups. This design substantially improves subgroup discovery accuracy and enhances statistical efficiency. We provide theoretical guarantees by rigorously establishing out of sample risk bounds and comparing them with those of classical tree based methods. Empirically, our method consistently outperforms popular baselines in simulations. Finally, we demonstrate its practical utility through a case study on the Health and Aging Brain Study Health Disparities (HABS-HD) dataset, where our approach yields clinically meaningful insights.
