Conditional Mean and Variance Estimation via \textit{k}-NN Algorithm with Automated Variance Selection

Marcos Matabuena; Juan C. Vidal; Oscar Hernan Madrid Padilla; Jukka-Pekka Onnela

Conditional Mean and Variance Estimation via \textit{k}-NN Algorithm with Automated Variance Selection

Marcos Matabuena, Juan C. Vidal, Oscar Hernan Madrid Padilla, Jukka-Pekka Onnela

TL;DR

This work addresses the challenge of nonparametric conditional distribution estimation in high dimensions by jointly estimating the conditional mean and variance using a k-NN framework augmented with a data-driven variable-selection step. The proposed VS-kNN method preserves the simplicity and scalability of k-NN while enabling accurate reconstruction of the conditional distribution and predictive intervals through mean-variance modeling and data-splitting. The authors establish consistency and convergence guarantees, provide adaptive k selection rules, and demonstrate substantial empirical gains over vanilla k-NN and GAMLSS in simulations and a large biomedical case study. The approach offers a practical, interpretable, and scalable tool for disease risk scoring and uncertainty quantification in big biomedical datasets, where the mean and variability contribute distinct information about risk.

Abstract

We introduce a novel \textit{k}-nearest neighbor (\textit{k}-NN) regression method for joint estimation of the conditional mean and variance. The proposed algorithm preserves the computational efficiency and manifold-learning capabilities of classical non-parametric \textit{k}-NN models, while integrating a data-driven variable selection step that improves empirical performance. By accurately estimating both conditional mean and variance regression functions, the method effectively reconstructs the conditional distribution and density functions for multiple families of scale-and-localization generative models. We show that our estimator can achieve fast convergence rates, and we derive practical rules for selecting the smoothing parameter~$k$ that enhance the precision of the algorithm in finite sample regimes. Extensive simulations for low, moderate and large-dimensional covariate spaces, together with a real-world biomedical application, demonstrate that the proposed method can consistently outperform the conventional \textit{k-NN} regression algorithm while being more interpretable in the model output.

Conditional Mean and Variance Estimation via \textit{k}-NN Algorithm with Automated Variance Selection

TL;DR

Abstract

that enhance the precision of the algorithm in finite sample regimes. Extensive simulations for low, moderate and large-dimensional covariate spaces, together with a real-world biomedical application, demonstrate that the proposed method can consistently outperform the conventional \textit{k-NN} regression algorithm while being more interpretable in the model output.

Paper Structure (30 sections, 3 theorems, 33 equations, 4 figures, 25 tables)

This paper contains 30 sections, 3 theorems, 33 equations, 4 figures, 25 tables.

Introduction
Summary of Contributions
Outline
Background and Related Work
Methodology
Mathematical Population Framework
Prediction Interval Definition.
Conditional Mean Estimation via k-NN Regression
Conditional Variance Estimation via Residuals
General Variable Selection Strategy for k-NN
Data Splitting Strategy and hyper-parameter selection
Model Extensions: Predictive Interval Algorithm
Theory
Simulation Study
Results: Impact of Variable Selection on kNN
...and 15 more sections

Key Result

Theorem 4.1

Assume $(\mathbb{E}(Y^{4}) \le L$ for $L>0$. For every $x$ in the support of $\mu$ and every radius $r>0$, we have $\mu\!\bigl(B_r(x)\bigr) > 0$. Then, if $k_1 \to \infty$ with $k_1/n_1 \to 0$ and $k_2 \to \infty$ with $k_2/n_2 \to 0$, the k-NN estimators for the mean $m(\cdot)$ and variance $\sigma

Figures (4)

Figure 1: We estimate the conditional probability $\mathbb{P}(Y \le y \mid X = x)$ for a fixed value $X = x$, using different variants of the $k$‑NN algorithm and GAMLSS. The analysis is based on a specific generative model where $\dim(X) = 10$, but only 4 predictors influence the outcome. The results highlight the importance of appropriate variable selection to accurately approximate the conditional distribution function.
Figure 2: Data-splitting strategy and extensions in our k-NN semi-parametric framework.
Figure 3: Comparison between true values and model predictions over a random sample of 200 participants. Blue crosses denote the true values, while orange circles represent the model’s predictions. The shaded gray area indicates the 95% prediction interval, constructed from the estimated lower and upper bounds for each prediction. This visualization enables an assessment of both the accuracy and the calibration of the model’s predictive uncertainty.
Figure 4: Scatter plots showing the predicted FPG for weight and pulse rate among participants in the AHS dataset.

Theorems & Definitions (7)

Definition 3.1: Homoscedastic Scale-Localization Model
Theorem 4.1: Consistency of the Mean and Variance Regression Functions
Remark 4.2
Theorem 4.3: Rates of the k-NN Scale-Localization Gaussian Model
Remark 4.4
Theorem 4.5: Universal Consistency of the k-NN Variable–Selection Rule
Remark 4.6

Conditional Mean and Variance Estimation via \textit{k}-NN Algorithm with Automated Variance Selection

TL;DR

Abstract

Conditional Mean and Variance Estimation via \textit{k}-NN Algorithm with Automated Variance Selection

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (7)