Table of Contents
Fetching ...

Distributional Random Forests for Complex Survey Designs on Reproducing Kernel Hilbert Spaces

Yating Zou, Marcos Matabuena, Michael R. Kosorok

TL;DR

This work tackles the problem of estimating the full conditional distribution $P(Y\mid X=x)$ under complex survey designs for multivariate outcomes. It introduces the survey-calibrated distributional random forest (SDRF), which integrates survey weights, pseudo-population bootstrap, PSU-level honesty, and an MMD-based split built on kernel mean embeddings to target the entire conditional law rather than just moments. The authors establish design-consistency and model-consistency results, derive a two-stage theoretical framework for the estimator, and demonstrate finite-sample performance through simulations and a NHANES-based case study that reveals distributional heterogeneity across subpopulations. The approach enables robust, distributional risk profiling and subgroup-specific public-health decision support in nationally representative studies, with practical guidance and availability of data/code for replication.

Abstract

We study estimation of the conditional law $P(Y|X=x)$ and continuous functionals $Ψ(P(Y|X=x))$ when $Y$ takes values in a locally compact Polish space, $X \in \mathbb{R}^p$, and the observations arise from a complex survey design. We propose a survey-calibrated distributional random forest (SDRF) that incorporates complex-design features via a pseudo-population bootstrap, PSU-level honesty, and a Maximum Mean Discrepancy (MMD) split criterion computed from kernel mean embeddings of Hájek-type (design-weighted) node distributions. We provide a framework for analyzing forest-style estimators under survey designs; establish design consistency for the finite-population target and model consistency for the super-population target under explicit conditions on the design, kernel, resampling multipliers, and tree partitions. As far as we are aware, these are the first results on model-free estimation of conditional distributions under survey designs. Simulations under a stratified two-stage cluster design provide finite sample performance and demonstrate the statistical error price of ignoring the survey design. The broad applicability of SDRF is demonstrated using NHANES: We estimate the tolerance regions of the conditional joint distribution of two diabetes biomarkers, illustrating how distributional heterogeneity can support subgroup-specific risk profiling for diabetes mellitus in the U.S. population.

Distributional Random Forests for Complex Survey Designs on Reproducing Kernel Hilbert Spaces

TL;DR

This work tackles the problem of estimating the full conditional distribution under complex survey designs for multivariate outcomes. It introduces the survey-calibrated distributional random forest (SDRF), which integrates survey weights, pseudo-population bootstrap, PSU-level honesty, and an MMD-based split built on kernel mean embeddings to target the entire conditional law rather than just moments. The authors establish design-consistency and model-consistency results, derive a two-stage theoretical framework for the estimator, and demonstrate finite-sample performance through simulations and a NHANES-based case study that reveals distributional heterogeneity across subpopulations. The approach enables robust, distributional risk profiling and subgroup-specific public-health decision support in nationally representative studies, with practical guidance and availability of data/code for replication.

Abstract

We study estimation of the conditional law and continuous functionals when takes values in a locally compact Polish space, , and the observations arise from a complex survey design. We propose a survey-calibrated distributional random forest (SDRF) that incorporates complex-design features via a pseudo-population bootstrap, PSU-level honesty, and a Maximum Mean Discrepancy (MMD) split criterion computed from kernel mean embeddings of Hájek-type (design-weighted) node distributions. We provide a framework for analyzing forest-style estimators under survey designs; establish design consistency for the finite-population target and model consistency for the super-population target under explicit conditions on the design, kernel, resampling multipliers, and tree partitions. As far as we are aware, these are the first results on model-free estimation of conditional distributions under survey designs. Simulations under a stratified two-stage cluster design provide finite sample performance and demonstrate the statistical error price of ignoring the survey design. The broad applicability of SDRF is demonstrated using NHANES: We estimate the tolerance regions of the conditional joint distribution of two diabetes biomarkers, illustrating how distributional heterogeneity can support subgroup-specific risk profiling for diabetes mellitus in the U.S. population.

Paper Structure

This paper contains 29 sections, 17 theorems, 258 equations, 4 figures, 2 tables, 2 algorithms.

Key Result

Lemma 2.1

Under (D2)-(D4), we have that with respect to the design law $P_{\mathcal{S}^N|\omega}$, $\frac{1}{N}\sum_{i=1}^N\!\left(\frac{\xi_i}{\pi_i}-1\right) \xrightarrow{p_d} 0$, and $\frac{1}{\sqrt{N}}\sum_{i=1}^N\!\left(\frac{\xi_i}{\pi_i}-1\right) = O_{p_d}(1).$

Figures (4)

  • Figure 1: Pointwise mean square error (MSE) of the estimator $\widehat{\mathbb{E}}[Y_1|X_1 = \mathbf{x}_1]$ from SDRF and DRF. Each point represents an average over 200 seeds.
  • Figure 2: Pointwise standard error (SD) of the estimator $\widehat{\mathbb{E}}[Y_1|X_1 = \mathbf{x}]$ from SDRF and DRF. Each point represents an average over 200 seeds.
  • Figure 3: SDRF--estimated tolerance regions (in expectation) for $\log(\mathrm{FPG})$ vs. $\mathrm{HbA1c}$ at quantile levels $\alpha \in \{0.10, 0.25, 0.50, 0.75, 0.90\}$. Panels display regions conditional on specific age groups (top two) and BMI ranges (bottom three). Green and gray points indicate observations falling inside and outside the estimated regions, respectively.
  • Figure 4: SDRF--estimated tolerance regions (in expectation) for $\log(\mathrm{FPG})$ vs. $\mathrm{HbA1c}$ stratified by gender. The contours correspond to $\alpha \in \{0.10, 0.25, 0.50, 0.75, 0.90\}$. Green and gray points indicate observations falling inside and outside the estimated regions, respectively.

Theorems & Definitions (45)

  • Remark
  • Lemma 2.1
  • Remark
  • Lemma 3.1
  • Remark
  • Theorem 3.2: Local design/model consistency of the MMD split
  • Remark
  • Proposition 3.3: Effect of averaging $M$ resampling draws on split-score stability
  • Remark
  • Theorem 3.4
  • ...and 35 more