Estimation of multiple mean vectors in high dimension
Gilles Blanchard, Jean-Baptiste Fermanian, Hannah Marienwald
TL;DR
The paper tackles the challenge of estimating many high-dimensional mean vectors ${oldsymbol u}_k$ from independent samples by forming estimators as convex combinations of per-task empirical means. It develops two data-driven aggregation strategies: a testing-based method that selects neighbouring tasks to control bias and a Q-aggregation method that minimizes an upper confidence bound on the risk; both achieve oracle-like risk in wide ranges of high-dimensional regimes. The analysis establishes high-dimensional minimax results, showing the oracle risk is minimax optimal under certain homogeneity conditions, and demonstrates dimension-driven improvements (a blessing of dimensionality) as effective dimension grows. The methods are validated on multiple kernel mean embeddings, including artificial Gaussian data and real flow-cytometry data, where they yield substantial improvements over naive per-task estimates and competitive baselines. Together, the work advances practical, adaptive multi-task mean estimation in high dimensions, with implications for federated, personalized, and distributional learning using kernel embeddings and related representations.
Abstract
We endeavour to estimate numerous multi-dimensional means of various probability distributions on a common space based on independent samples. Our approach involves forming estimators through convex combinations of empirical means derived from these samples. We introduce two strategies to find appropriate data-dependent convex combination weights: a first one employing a testing procedure to identify neighbouring means with low variance, which results in a closed-form plug-in formula for the weights, and a second one determining weights via minimization of an upper confidence bound on the quadratic risk. Through theoretical analysis, we evaluate the improvement in quadratic risk offered by our methods compared to the empirical means. Our analysis focuses on a dimensional asymptotics perspective, showing that our methods asymptotically approach an oracle (minimax) improvement as the effective dimension of the data increases. We demonstrate the efficacy of our methods in estimating multiple kernel mean embeddings through experiments on both simulated and real-world datasets.
