Doubly robust nearest neighbors in factor models

Raaz Dwivedi; Katherine Tian; Sabina Tomkins; Predrag Klasnja; Susan Murphy; Devavrat Shah

Doubly robust nearest neighbors in factor models

Raaz Dwivedi, Katherine Tian, Sabina Tomkins, Predrag Klasnja, Susan Murphy, Devavrat Shah

TL;DR

This work addresses matrix completion under latent-factor models with missing entries by introducing a doubly robust nearest-neighbors (DR-NN) estimator that fuses unit-NN and time-NN ideas. DR-NN achieves consistency when either unit or time neighbors are informative and attains near-quadratic improvements in non-asymptotic error when both are informative, with substantially narrower asymptotic confidence intervals. The analysis provides non-asymptotic and asymptotic guarantees under both bilinear and general non-linear factor models, with data-splitting to enable clean bias-variance decomposition. The approach extends to tensor denoising and has connections to bias-correction and orthogonal loss frameworks, offering a principled, interpretable, and robust alternative to vanilla NN methods in missing-data settings.

Abstract

We introduce and analyze an improved variant of nearest neighbors (NN) for estimation with missing data in latent factor models. We consider a matrix completion problem with missing data, where the $(i, t)$-th entry, when observed, is given by its mean $f(u_i, v_t)$ plus mean-zero noise for an unknown function $f$ and latent factors $u_i$ and $v_t$. Prior NN strategies, like unit-unit NN, for estimating the mean $f(u_i, v_t)$ relies on existence of other rows $j$ with $u_j \approx u_i$. Similarly, time-time NN strategy relies on existence of columns $t'$ with $v_{t'} \approx v_t$. These strategies provide poor performance respectively when similar rows or similar columns are not available. Our estimate is doubly robust to this deficit in two ways: (1) As long as there exist either good row or good column neighbors, our estimate provides a consistent estimate. (2) Furthermore, if both good row and good column neighbors exist, it provides a (near-)quadratic improvement in the non-asymptotic error and admits a significantly narrower asymptotic confidence interval when compared to both unit-unit or time-time NN.

Doubly robust nearest neighbors in factor models

TL;DR

Abstract

We introduce and analyze an improved variant of nearest neighbors (NN) for estimation with missing data in latent factor models. We consider a matrix completion problem with missing data, where the

-th entry, when observed, is given by its mean

plus mean-zero noise for an unknown function

and latent factors

and

. Prior NN strategies, like unit-unit NN, for estimating the mean

relies on existence of other rows

with

. Similarly, time-time NN strategy relies on existence of columns

with

. These strategies provide poor performance respectively when similar rows or similar columns are not available. Our estimate is doubly robust to this deficit in two ways: (1) As long as there exist either good row or good column neighbors, our estimate provides a consistent estimate. (2) Furthermore, if both good row and good column neighbors exist, it provides a (near-)quadratic improvement in the non-asymptotic error and admits a significantly narrower asymptotic confidence interval when compared to both unit-unit or time-time NN.

Paper Structure (32 sections, 5 theorems, 56 equations)

This paper contains 32 sections, 5 theorems, 56 equations.

Introduction
Our contributions
Organization
Problem set-up and algorithm
Data generating mechanism
Algorithm
An intuitive construction of DR-NN
Unit-NN
Time-NN
Steps towards an improved nearest neighbors estimate
"Incorrect" guesses
The "correct" guess
Vanilla-NN and DR-NN for missing and noisy data
Unit-NN
Time-NN
...and 17 more sections

Key Result

Theorem 3.1

Let assum:mcarassum:noiseassum:non_linear_f be in force, and consider a tuple $(i, t)$. Given a fixed $\delta \in (0, 1)$, suppose that the hyperparameter $\boldsymbol{\eta}=(\eta_1,\eta_2)$ satisfies the regularity condition eq:eta_cond_f. Then there exists universal constants $c,c'$ such that cond

Theorems & Definitions (12)

Remark 1: Sample split for theoretical analysis
Remark 2: Estimates when there are no neighbors
Example : Discrete unit factors
Example : Continuous unit factors
Theorem 3.1: Non-asymptotic guarantee for $\widehat{\theta}_{{i},{t}, \boldsymbol{\eta}}^{\textrm{DR}}$ with non-linear factor model
Corollary 1: Non-asymptotic guarantee $\widehat{\theta}_{{i},{t}, \boldsymbol{\eta}}^{\textrm{DR}}$ with bilinear factor model
Corollary 2: Error rates for $\widehat{\theta}_{{i},{t}, \boldsymbol{\eta}}^{\textrm{DR}}$ for \ref{['example:finite', 'example:continuous']}
Remark 3: No neighbors yields vacuous guarantee
Corollary 3: Error rates for $\widehat{\theta}_{{i},{t}, \boldsymbol{\eta}}^{\textrm{DR}}$ with non-linear factor models
Theorem 3.2: Asymptotic error guarantee for $\widehat{\theta}_{{i},{t}, \boldsymbol{\eta}}^{\textrm{DR}}$ with non-linear factor models
...and 2 more

Doubly robust nearest neighbors in factor models

TL;DR

Abstract

Doubly robust nearest neighbors in factor models

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (12)