Table of Contents
Fetching ...

Empirical Bayes Predictive Density Estimation under Covariate Shift in Large Imbalanced Linear Mixed Models

Abir Sarkar, Gourab Mukherjee, Keisuke Yano

Abstract

We study empirical Bayes (EB) predictive density estimation in linear mixed models (LMMs) with large number of units, which induce a high dimensional random effects space. Focusing on Kullback Leibler (KL) risk minimization, we develop a calibration framework to optimally tune predictive densities derived from on a broad class of flexible priors. Our proposed method addresses two key challenges in predictive inference: (a) severe data scarcity leading to highly imbalanced designs, in which replicates are available for only a small subset of units; and (b) distributional shifts in future covariates. To estimate predictive KL risk in LMMs, we use a data-fission approach that leverages exchangeability in the covariate distribution. We establish convergence rates for our proposed risk estimators and show how their efficiency deteriorates as data scarcity increases. Our results imply the decision-theoretic optimality of the proposed EB predictive density estimator. The theoretical development relies on a novel probabilistic analysis of the interaction between data fission, sample reuse, and the predictive heat-equation representation of George et al. (2006), which expresses predictive KL risk through expected log-marginals. Extensive simulation studies demonstrate strong predictive performance and robustness of the proposed approach across diverse regimes with varying degrees of data scarcity and covariate shift.

Empirical Bayes Predictive Density Estimation under Covariate Shift in Large Imbalanced Linear Mixed Models

Abstract

We study empirical Bayes (EB) predictive density estimation in linear mixed models (LMMs) with large number of units, which induce a high dimensional random effects space. Focusing on Kullback Leibler (KL) risk minimization, we develop a calibration framework to optimally tune predictive densities derived from on a broad class of flexible priors. Our proposed method addresses two key challenges in predictive inference: (a) severe data scarcity leading to highly imbalanced designs, in which replicates are available for only a small subset of units; and (b) distributional shifts in future covariates. To estimate predictive KL risk in LMMs, we use a data-fission approach that leverages exchangeability in the covariate distribution. We establish convergence rates for our proposed risk estimators and show how their efficiency deteriorates as data scarcity increases. Our results imply the decision-theoretic optimality of the proposed EB predictive density estimator. The theoretical development relies on a novel probabilistic analysis of the interaction between data fission, sample reuse, and the predictive heat-equation representation of George et al. (2006), which expresses predictive KL risk through expected log-marginals. Extensive simulation studies demonstrate strong predictive performance and robustness of the proposed approach across diverse regimes with varying degrees of data scarcity and covariate shift.

Paper Structure

This paper contains 23 sections, 12 theorems, 254 equations, 3 figures, 1 table.

Key Result

Theorem 1

For any fixed $\bm{\beta}$ and $\sigma>0$, under assumptions assumption1- assumption3 for some $\alpha \in [0,1/2]$ the risk of $\hat{p}[\hat{\bm{\beta}},\hat{\sigma}, g]$ is given by: where, the expectations are based on the model eq:model.11-eq:model.12 and the functions $m_g$ and $\tilde{m}_g$ are defined in m.defined.1-m.defined.2.

Figures (3)

  • Figure 1: Plot of $D_n(h_n)$ across the six regimes A--F (ordered from top left by row) as a function of $n$. The curves correspond to different choices of $h_n$: blue ($h_n=0$), orange ($h_n=\log n$), green ($h_n=n^{1/4}$), and red ($h_n=n^{1/2}$). The dotted purple lines represent the growth rates implied by the asymptotic theory.
  • Figure 2: Plot of the improvement factor $\textsf{IF}_n$ (expressed as fractions) across the six regimes A--F (ordered from top left by row) as a function of $n$. The curves correspond to different choices of $h_n$: blue ($h_n=0$), orange ($h_n=\log n$), green ($h_n=n^{1/4}$), and red ($h_n=n^{1/2}$).
  • Figure 3: Plot of the excess KL risk of the competing prdes relative to the Bayes benchmark across the six regimes A--F (ordered from top left by row) as a function of $n$. The curves correspond to prdes for different $h_n$: blue ($h_n=0$), orange ($h_n=\log n$), green ($h_n=n^{1/4}$), and red ($h_n=n^{1/2}$), black (plugin: $g$-modeling), and purple (naive plugin).

Theorems & Definitions (15)

  • Theorem 1
  • Lemma 1
  • Theorem 2
  • Corollary 1
  • Proposition 1
  • Theorem 3
  • Theorem 4
  • Lemma 2
  • Lemma 3
  • Lemma 4
  • ...and 5 more