Table of Contents
Fetching ...

Generalization Bounds for Semi-supervised Matrix Completion with Distributional Side Information

Antoine Ledent, Mun Chong Soo, Nong Minh Hieu

TL;DR

The paper tackles semi-supervised matrix completion where both the ground-truth matrix $\mathop{\mathrm{G}}$ and the sampling distribution $P$ are low-rank and share a common subspace. It introduces DAMC, a method that first uses unlabeled samples to recover a shared low-rank subspace from the distribution over observed entries, then applies Inductive Matrix Completion (IMC) with a nuclear-norm constraint to recover $\mathop{\mathrm{G}}$ from labeled data. The main theoretical contribution is a generalization bound that decomposes into two additive terms: $\widetilde{O}\left(\sqrt{\tfrac{[m+n]r}{M}}\right)$ for subspace estimation from unlabeled data and $\widetilde{O}\left(\sqrt{\tfrac{dr}{N}}\right)$ for ground-truth recovery from labeled data, plus a higher-order term that vanishes under mild conditions on the sampling distribution. Empirically, the authors validate the additive decomposition on synthetic data and demonstrate that leveraging unlabeled data improves explicit-feedback prediction on real recommender-system datasets (Douban, MovieLens, Yelp), often outperforming baselines relying only on explicit ratings. This work formalizes a toy-but-principled linkage between implicit and explicit feedback through shared subspaces and provides practical guidance for leveraging abundant unlabeled interactions in recommendation tasks.

Abstract

We study a matrix completion problem where both the ground truth $R$ matrix and the unknown sampling distribution $P$ over observed entries are low-rank matrices, and \textit{share a common subspace}. We assume that a large amount $M$ of \textit{unlabeled} data drawn from the sampling distribution $P$ is available, together with a small amount $N$ of labeled data drawn from the same distribution and noisy estimates of the corresponding ground truth entries. This setting is inspired by recommender systems scenarios where the unlabeled data corresponds to `implicit feedback' (consisting in interactions such as purchase, click, etc. ) and the labeled data corresponds to the `explicit feedback', consisting of interactions where the user has given an explicit rating to the item. Leveraging powerful results from the theory of low-rank subspace recovery, together with classic generalization bounds for matrix completion models, we show error bounds consisting of a sum of two error terms scaling as $\widetilde{O}\left(\sqrt{\frac{nd}{M}}\right)$ and $\widetilde{O}\left(\sqrt{\frac{dr}{N}}\right)$ respectively, where $d$ is the rank of $P$ and $r$ is the rank of $M$. In synthetic experiments, we confirm that the true generalization error naturally splits into independent error terms corresponding to the estimations of $P$ and and the ground truth matrix $\ground$ respectively. In real-life experiments on Douban and MovieLens with most explicit ratings removed, we demonstrate that the method can outperform baselines relying only on the explicit ratings, demonstrating that our assumptions provide a valid toy theoretical setting to study the interaction between explicit and implicit feedbacks in recommender systems.

Generalization Bounds for Semi-supervised Matrix Completion with Distributional Side Information

TL;DR

The paper tackles semi-supervised matrix completion where both the ground-truth matrix and the sampling distribution are low-rank and share a common subspace. It introduces DAMC, a method that first uses unlabeled samples to recover a shared low-rank subspace from the distribution over observed entries, then applies Inductive Matrix Completion (IMC) with a nuclear-norm constraint to recover from labeled data. The main theoretical contribution is a generalization bound that decomposes into two additive terms: for subspace estimation from unlabeled data and for ground-truth recovery from labeled data, plus a higher-order term that vanishes under mild conditions on the sampling distribution. Empirically, the authors validate the additive decomposition on synthetic data and demonstrate that leveraging unlabeled data improves explicit-feedback prediction on real recommender-system datasets (Douban, MovieLens, Yelp), often outperforming baselines relying only on explicit ratings. This work formalizes a toy-but-principled linkage between implicit and explicit feedback through shared subspaces and provides practical guidance for leveraging abundant unlabeled interactions in recommendation tasks.

Abstract

We study a matrix completion problem where both the ground truth matrix and the unknown sampling distribution over observed entries are low-rank matrices, and \textit{share a common subspace}. We assume that a large amount of \textit{unlabeled} data drawn from the sampling distribution is available, together with a small amount of labeled data drawn from the same distribution and noisy estimates of the corresponding ground truth entries. This setting is inspired by recommender systems scenarios where the unlabeled data corresponds to `implicit feedback' (consisting in interactions such as purchase, click, etc. ) and the labeled data corresponds to the `explicit feedback', consisting of interactions where the user has given an explicit rating to the item. Leveraging powerful results from the theory of low-rank subspace recovery, together with classic generalization bounds for matrix completion models, we show error bounds consisting of a sum of two error terms scaling as and respectively, where is the rank of and is the rank of . In synthetic experiments, we confirm that the true generalization error naturally splits into independent error terms corresponding to the estimations of and and the ground truth matrix respectively. In real-life experiments on Douban and MovieLens with most explicit ratings removed, we demonstrate that the method can outperform baselines relying only on the explicit ratings, demonstrating that our assumptions provide a valid toy theoretical setting to study the interaction between explicit and implicit feedbacks in recommender systems.

Paper Structure

This paper contains 22 sections, 16 theorems, 76 equations, 1 figure, 2 tables, 1 algorithm.

Key Result

Theorem 1

Instate Assumptions assum:loss, assum:lowrankshared, assum:unimarg, assum:wellconditioned, assum:kappa2,and Assum:gamma, then: With probability greater than $1-\delta$ over the draw of both the implicit and explicit feedbacks, the following generalization bound holds simultaneously over any predictor $X\underline{M} Y^\top\in\mathbb{R}^{m\times n}$ for $\underline{M}\in\mathbb{R}^{d\times d}$ su

Figures (1)

  • Figure 1: Comparison of generalization error (x-axis) and the corresponding disentangled estimate (y axis) in the synthetic dataset. Each point in the scatter plot corresponds to one configuration $(M,N)$, with the results averaged over 30 independent runs.

Theorems & Definitions (26)

  • Theorem 1
  • Corollary 1
  • Proposition 1
  • Remark 1
  • proof
  • Lemma 1
  • proof
  • Lemma 2
  • proof
  • Lemma 3
  • ...and 16 more