Conditional Mean and Variance Estimation via \textit{k}-NN Algorithm with Automated Variance Selection
Marcos Matabuena, Juan C. Vidal, Oscar Hernan Madrid Padilla, Jukka-Pekka Onnela
TL;DR
This work addresses the challenge of nonparametric conditional distribution estimation in high dimensions by jointly estimating the conditional mean and variance using a k-NN framework augmented with a data-driven variable-selection step. The proposed VS-kNN method preserves the simplicity and scalability of k-NN while enabling accurate reconstruction of the conditional distribution and predictive intervals through mean-variance modeling and data-splitting. The authors establish consistency and convergence guarantees, provide adaptive k selection rules, and demonstrate substantial empirical gains over vanilla k-NN and GAMLSS in simulations and a large biomedical case study. The approach offers a practical, interpretable, and scalable tool for disease risk scoring and uncertainty quantification in big biomedical datasets, where the mean and variability contribute distinct information about risk.
Abstract
We introduce a novel \textit{k}-nearest neighbor (\textit{k}-NN) regression method for joint estimation of the conditional mean and variance. The proposed algorithm preserves the computational efficiency and manifold-learning capabilities of classical non-parametric \textit{k}-NN models, while integrating a data-driven variable selection step that improves empirical performance. By accurately estimating both conditional mean and variance regression functions, the method effectively reconstructs the conditional distribution and density functions for multiple families of scale-and-localization generative models. We show that our estimator can achieve fast convergence rates, and we derive practical rules for selecting the smoothing parameter~$k$ that enhance the precision of the algorithm in finite sample regimes. Extensive simulations for low, moderate and large-dimensional covariate spaces, together with a real-world biomedical application, demonstrate that the proposed method can consistently outperform the conventional \textit{k-NN} regression algorithm while being more interpretable in the model output.
