Table of Contents
Fetching ...

Estimation of multiple mean vectors in high dimension

Gilles Blanchard, Jean-Baptiste Fermanian, Hannah Marienwald

TL;DR

The paper tackles the challenge of estimating many high-dimensional mean vectors ${oldsymbol u}_k$ from independent samples by forming estimators as convex combinations of per-task empirical means. It develops two data-driven aggregation strategies: a testing-based method that selects neighbouring tasks to control bias and a Q-aggregation method that minimizes an upper confidence bound on the risk; both achieve oracle-like risk in wide ranges of high-dimensional regimes. The analysis establishes high-dimensional minimax results, showing the oracle risk is minimax optimal under certain homogeneity conditions, and demonstrates dimension-driven improvements (a blessing of dimensionality) as effective dimension grows. The methods are validated on multiple kernel mean embeddings, including artificial Gaussian data and real flow-cytometry data, where they yield substantial improvements over naive per-task estimates and competitive baselines. Together, the work advances practical, adaptive multi-task mean estimation in high dimensions, with implications for federated, personalized, and distributional learning using kernel embeddings and related representations.

Abstract

We endeavour to estimate numerous multi-dimensional means of various probability distributions on a common space based on independent samples. Our approach involves forming estimators through convex combinations of empirical means derived from these samples. We introduce two strategies to find appropriate data-dependent convex combination weights: a first one employing a testing procedure to identify neighbouring means with low variance, which results in a closed-form plug-in formula for the weights, and a second one determining weights via minimization of an upper confidence bound on the quadratic risk. Through theoretical analysis, we evaluate the improvement in quadratic risk offered by our methods compared to the empirical means. Our analysis focuses on a dimensional asymptotics perspective, showing that our methods asymptotically approach an oracle (minimax) improvement as the effective dimension of the data increases. We demonstrate the efficacy of our methods in estimating multiple kernel mean embeddings through experiments on both simulated and real-world datasets.

Estimation of multiple mean vectors in high dimension

TL;DR

The paper tackles the challenge of estimating many high-dimensional mean vectors from independent samples by forming estimators as convex combinations of per-task empirical means. It develops two data-driven aggregation strategies: a testing-based method that selects neighbouring tasks to control bias and a Q-aggregation method that minimizes an upper confidence bound on the risk; both achieve oracle-like risk in wide ranges of high-dimensional regimes. The analysis establishes high-dimensional minimax results, showing the oracle risk is minimax optimal under certain homogeneity conditions, and demonstrates dimension-driven improvements (a blessing of dimensionality) as effective dimension grows. The methods are validated on multiple kernel mean embeddings, including artificial Gaussian data and real flow-cytometry data, where they yield substantial improvements over naive per-task estimates and competitive baselines. Together, the work advances practical, adaptive multi-task mean estimation in high dimensions, with implications for federated, personalized, and distributional learning using kernel embeddings and related representations.

Abstract

We endeavour to estimate numerous multi-dimensional means of various probability distributions on a common space based on independent samples. Our approach involves forming estimators through convex combinations of empirical means derived from these samples. We introduce two strategies to find appropriate data-dependent convex combination weights: a first one employing a testing procedure to identify neighbouring means with low variance, which results in a closed-form plug-in formula for the weights, and a second one determining weights via minimization of an upper confidence bound on the quadratic risk. Through theoretical analysis, we evaluate the improvement in quadratic risk offered by our methods compared to the empirical means. Our analysis focuses on a dimensional asymptotics perspective, showing that our methods asymptotically approach an oracle (minimax) improvement as the effective dimension of the data increases. We demonstrate the efficacy of our methods in estimating multiple kernel mean embeddings through experiments on both simulated and real-world datasets.
Paper Structure (81 sections, 33 theorems, 282 equations, 12 figures, 4 tables)

This paper contains 81 sections, 33 theorems, 282 equations, 12 figures, 4 tables.

Key Result

Lemma 1

Let $\tau >0$ be fixed. For all $V \subseteq V_\tau$, the weights ${\bm{\omega}}_V^* \in {\mathcal{S}}_V$ that minimise eq:riskboundV yield the bound The oracle weights ${\bm{\omega}}^*_V$ are given by:

Figures (12)

  • Figure 1: Decrease in average quadratic estimation error compared to NE in percent on Gaussian data settings (a) and (b) resp. Higher is better. The hashed histogram bars in (b) show the bag sizes for the bags $1$ to $50$, which vary between $10$ and $300$ (right axis).
  • Figure 2: Decrease in estimation error compared to NE in percent on the flow cytometry data. Higher is better. The number next to the boxplot quantifies the median, which is also depicted as a line. The mean is visualised as a circle. From left to right: results on individual cell types $1,2,3,4,7,8,9,$ and all cell types taken jointly.
  • Figure 3: Excess relative risk for the estimation of $\mu_1$ using AGG egd with $c_q = \sqrt{\log(B)}$, $c_1=c_2 =c_{bs}=0$, in function of the dimension. Each curve corresponds to a different value of $\delta$. Each point is the mean of $500$ realizations. Data generation is detailed above.
  • Figure 4: Example images of handwritten digits from the MNIST data set.
  • Figure 5: $V_{\tau,\varsigma}$ for the optimal values of $\tau$ and $\varsigma$ for STB opt (left) and STB egd (right) on the MNIST data set. $V_{\tau,\varsigma}$ is not necessarily symmetric, and each row corresponds to the outcome of the neighbouring test for one mean, with white indicating 'neighbours' and black 'no neighbours'.
  • ...and 7 more figures

Theorems & Definitions (49)

  • Definition 1: $\tau$-neighbouring tasks
  • Definition 2: Relative aggregated variance $\nu$
  • Definition 3
  • Lemma 1
  • Lemma 2
  • Proposition 1
  • Corollary 1
  • Proposition 2
  • Theorem 1
  • Theorem 2
  • ...and 39 more