Table of Contents
Fetching ...

Efficient Estimation of the Central Mean Subspace via Smoothed Gradient Outer Products

Gan Yuan, Mingyue Xu, Samory Kpotufe, Daniel Hsu

Abstract

We consider the problem of sufficient dimension reduction (SDR) for multi-index models. The estimators of the central mean subspace in prior works either have slow (non-parametric) convergence rates, or rely on stringent distributional conditions (e.g., the covariate distribution $P_{\mathbf{X}}$ being elliptical symmetric). In this paper, we show that a fast parametric convergence rate of form $C_d \cdot n^{-1/2}$ is achievable via estimating the \emph{expected smoothed gradient outer product}, for a general class of distribution $P_{\mathbf{X}}$ admitting Gaussian or heavier distributions. When the link function is a polynomial with a degree of at most $r$ and $P_{\mathbf{X}}$ is the standard Gaussian, we show that the prefactor depends on the ambient dimension $d$ as $C_d \propto d^r$.

Efficient Estimation of the Central Mean Subspace via Smoothed Gradient Outer Products

Abstract

We consider the problem of sufficient dimension reduction (SDR) for multi-index models. The estimators of the central mean subspace in prior works either have slow (non-parametric) convergence rates, or rely on stringent distributional conditions (e.g., the covariate distribution being elliptical symmetric). In this paper, we show that a fast parametric convergence rate of form is achievable via estimating the \emph{expected smoothed gradient outer product}, for a general class of distribution admitting Gaussian or heavier distributions. When the link function is a polynomial with a degree of at most and is the standard Gaussian, we show that the prefactor depends on the ambient dimension as .
Paper Structure (31 sections, 15 theorems, 69 equations, 3 figures, 1 table, 2 algorithms)

This paper contains 31 sections, 15 theorems, 69 equations, 3 figures, 1 table, 2 algorithms.

Key Result

Proposition 3.2

\newlabelprop:exhaust0 Suppose that assume:basic holds, and that the link function $f$ satisfies $\mathop{\mathrm{\mathbb{E}}}\limits_{\boldsymbol{Z}\sim\mathcal{N}(\boldsymbol{0}_k, h^2 \boldsymbol{I}_k)} [f(\boldsymbol{Z})^2] < \infty$. Then, for $h, \sigma_\theta > 0$, the ESGOP $\overline{\bol

Figures (3)

  • Figure 1: The subspace estimation error $d(\widehat{\boldsymbol{U}}, \boldsymbol{U})$ v.s. sampling budget $n$. Here, we fixed the number of partitions $m = 15$ and $\sigma_{\theta} = h/\sqrt{20 + 10d}$. We replicate 10 times for each pair of $(n,h)$ and plot the mean (the dots) and the standard error (the error bars) of the estimation error. When $P_{\boldsymbol{X}} =$ standard Gaussian (left), the performance of \ref{['alg:main']} is quite sensitive to the choice of $h$. The optimal choice of $h$ is around the data variance 1. When $h$ gets smaller (e.g., $h=0.5$) or larger (e.g., $h=1.2, 1.5$), we observe larger errors under the same budget level. This actually coincides approximately with the minimizer of $\mu_{\rho}$ in terms of $h$ (c.f. \ref{['prop:var']}). When $P_{\boldsymbol{X}} =$ standard Cauchy, the method is more robust to the choice of $h$.
  • Figure 1: An example plot of the link function $f$ as defined in \ref{['eqn:save_link']}. Here, we have $\{Y > y\} = [z_1, z_2]$ from the plot $\{x: y \le f(x) \}$, where $z_1 = f_0^{-1}(y)$ and $z_2 = \nu^{-1}(z_1) = \nu^{-1}(f_0^{-1}(y))$.
  • Figure 2: The subspace estimation error $d(\widehat{\boldsymbol{U}},\boldsymbol{U})$ v.s. the choice of $m$. We replicate 10 times for each $m$, and plot the mean (the dots) and the standard error (the error bars) of the subspace estimation errors. When $m$ is small, the ASGOP $\widetilde{\boldsymbol{M}}$ is not guaranteed to be exhaustive, and only a proper subspace of $\mathcal{U}$ can be recovered. This results in a large subspace estimation error. When $m$ is larger than a certain threshold (c.f., \ref{['cor:exhaust']}), the ASGOP $\widetilde{\boldsymbol{M}}$ is exhaustive with high probability, and the subspace error has an upper-bound that grows at the rate of $O(\sqrt{m})$. The result in the figure matches the reasoning above, as the subspace estimation error drops sharply atk the regime where $m$ is small, and grows gradually for large $m$.

Theorems & Definitions (42)

  • Definition 2.1: Mean dimension-reduction subspaces and central mean subspace Cook02cms
  • Definition 2.2: Distance with Optimal Rotation, Adapted from Chen2021spectral
  • Definition 3.1
  • Proposition 3.2: Exhaustiveness of $\overline{\boldsymbol{M}}$
  • Definition 3.3
  • Corollary 3.4: Exhaustiveness of $\widetilde{\boldsymbol{M}}$
  • Remark 3.5
  • Lemma 3.6: Stein's Lemma, chen2011stein
  • Proposition 3.7
  • Remark 3.8
  • ...and 32 more