Table of Contents
Fetching ...

N$^2$: A Unified Python Package and Test Bench for Nearest Neighbor-Based Matrix Completion

Caleb Chin, Aashish Khubchandani, Harshvardhan Maskara, Kyuseong Choi, Jacob Feitelberg, Albert Gong, Manit Paul, Tathagata Sadhukhan, Anish Agarwal, Raaz Dwivedi

TL;DR

This paper introduces N$^2$, a unified Python package and testbed that consolidates a broad class of NN-based methods through a modular, extensible interface and introduces a new NN variant that achieves state-of-the-art results in several settings.

Abstract

Nearest neighbor (NN) methods have re-emerged as competitive tools for matrix completion, offering strong empirical performance and recent theoretical guarantees, including entry-wise error bounds, confidence intervals, and minimax optimality. Despite their simplicity, recent work has shown that NN approaches are robust to a range of missingness patterns and effective across diverse applications. This paper introduces N$^2$, a unified Python package and testbed that consolidates a broad class of NN-based methods through a modular, extensible interface. Built for both researchers and practitioners, N$^2$ supports rapid experimentation and benchmarking. Using this framework, we introduce a new NN variant that achieves state-of-the-art results in several settings. We also release a benchmark suite of real-world datasets, from healthcare and recommender systems to causal inference and LLM evaluation, designed to stress-test matrix completion methods beyond synthetic scenarios. Our experiments demonstrate that while classical methods excel on idealized data, NN-based techniques consistently outperform them in real-world settings.

N$^2$: A Unified Python Package and Test Bench for Nearest Neighbor-Based Matrix Completion

TL;DR

This paper introduces N, a unified Python package and testbed that consolidates a broad class of NN-based methods through a modular, extensible interface and introduces a new NN variant that achieves state-of-the-art results in several settings.

Abstract

Nearest neighbor (NN) methods have re-emerged as competitive tools for matrix completion, offering strong empirical performance and recent theoretical guarantees, including entry-wise error bounds, confidence intervals, and minimax optimality. Despite their simplicity, recent work has shown that NN approaches are robust to a range of missingness patterns and effective across diverse applications. This paper introduces N, a unified Python package and testbed that consolidates a broad class of NN-based methods through a modular, extensible interface. Built for both researchers and practitioners, N supports rapid experimentation and benchmarking. Using this framework, we introduce a new NN variant that achieves state-of-the-art results in several settings. We also release a benchmark suite of real-world datasets, from healthcare and recommender systems to causal inference and LLM evaluation, designed to stress-test matrix completion methods beyond synthetic scenarios. Our experiments demonstrate that while classical methods excel on idealized data, NN-based techniques consistently outperform them in real-world settings.

Paper Structure

This paper contains 41 sections, 3 equations, 11 figures, 1 table.

Figures (11)

  • Figure 1: Error scaling for certain NN variants in synthetic experiments. See \ref{['app:synthetic_gen']} for details on the data-generating process and how the signal-to-noise ratio (SNR) is defined. Each point corresponds to the mean absolute error ± 1 standard error across 30 trials.
  • Figure 2: HeartSteps: estimating step count under scalar and distributional matrix completion settings. Panel (a) shows the absolute error of predicted step count of the nearest neighbor methods against matrix completion baselines (SoftImpute, USVT). Panel (b) shows an example of an imputed entry in the distributional matrix completion setting.
  • Figure 3: MovieLens: Estimation error for a random subsample of size 500. For experimental settings and discussion see \ref{['sec:movielens']}.
  • Figure 4: Nearest neighbor methods generate high-fidelity synthetic controls in counterfactual inference for panel data. For exact settings and further discussion see \ref{['sub:prop99']}.
  • Figure 5: Distributional nearest neighbor methods enable efficient LLM evaluation on MMLU. We estimate LLM score distributions across all models and tasks given only a limited number of model-task evaluations, determined by the propensity $p$. See \ref{['sub:llm']} for a detailed discussion.
  • ...and 6 more figures

Theorems & Definitions (1)

  • Remark 1