Doubly robust nearest neighbors in factor models
Raaz Dwivedi, Katherine Tian, Sabina Tomkins, Predrag Klasnja, Susan Murphy, Devavrat Shah
TL;DR
This work addresses matrix completion under latent-factor models with missing entries by introducing a doubly robust nearest-neighbors (DR-NN) estimator that fuses unit-NN and time-NN ideas. DR-NN achieves consistency when either unit or time neighbors are informative and attains near-quadratic improvements in non-asymptotic error when both are informative, with substantially narrower asymptotic confidence intervals. The analysis provides non-asymptotic and asymptotic guarantees under both bilinear and general non-linear factor models, with data-splitting to enable clean bias-variance decomposition. The approach extends to tensor denoising and has connections to bias-correction and orthogonal loss frameworks, offering a principled, interpretable, and robust alternative to vanilla NN methods in missing-data settings.
Abstract
We introduce and analyze an improved variant of nearest neighbors (NN) for estimation with missing data in latent factor models. We consider a matrix completion problem with missing data, where the $(i, t)$-th entry, when observed, is given by its mean $f(u_i, v_t)$ plus mean-zero noise for an unknown function $f$ and latent factors $u_i$ and $v_t$. Prior NN strategies, like unit-unit NN, for estimating the mean $f(u_i, v_t)$ relies on existence of other rows $j$ with $u_j \approx u_i$. Similarly, time-time NN strategy relies on existence of columns $t'$ with $v_{t'} \approx v_t$. These strategies provide poor performance respectively when similar rows or similar columns are not available. Our estimate is doubly robust to this deficit in two ways: (1) As long as there exist either good row or good column neighbors, our estimate provides a consistent estimate. (2) Furthermore, if both good row and good column neighbors exist, it provides a (near-)quadratic improvement in the non-asymptotic error and admits a significantly narrower asymptotic confidence interval when compared to both unit-unit or time-time NN.
