Generalization Bounds for Semi-supervised Matrix Completion with Distributional Side Information

Antoine Ledent; Mun Chong Soo; Nong Minh Hieu

Generalization Bounds for Semi-supervised Matrix Completion with Distributional Side Information

Antoine Ledent, Mun Chong Soo, Nong Minh Hieu

TL;DR

The paper tackles semi-supervised matrix completion where both the ground-truth matrix $\mathop{\mathrm{G}}$ and the sampling distribution $P$ are low-rank and share a common subspace. It introduces DAMC, a method that first uses unlabeled samples to recover a shared low-rank subspace from the distribution over observed entries, then applies Inductive Matrix Completion (IMC) with a nuclear-norm constraint to recover $\mathop{\mathrm{G}}$ from labeled data. The main theoretical contribution is a generalization bound that decomposes into two additive terms: $\widetilde{O}\left(\sqrt{\tfrac{[m+n]r}{M}}\right)$ for subspace estimation from unlabeled data and $\widetilde{O}\left(\sqrt{\tfrac{dr}{N}}\right)$ for ground-truth recovery from labeled data, plus a higher-order term that vanishes under mild conditions on the sampling distribution. Empirically, the authors validate the additive decomposition on synthetic data and demonstrate that leveraging unlabeled data improves explicit-feedback prediction on real recommender-system datasets (Douban, MovieLens, Yelp), often outperforming baselines relying only on explicit ratings. This work formalizes a toy-but-principled linkage between implicit and explicit feedback through shared subspaces and provides practical guidance for leveraging abundant unlabeled interactions in recommendation tasks.

Abstract

We study a matrix completion problem where both the ground truth $R$ matrix and the unknown sampling distribution $P$ over observed entries are low-rank matrices, and \textit{share a common subspace}. We assume that a large amount $M$ of \textit{unlabeled} data drawn from the sampling distribution $P$ is available, together with a small amount $N$ of labeled data drawn from the same distribution and noisy estimates of the corresponding ground truth entries. This setting is inspired by recommender systems scenarios where the unlabeled data corresponds to `implicit feedback' (consisting in interactions such as purchase, click, etc. ) and the labeled data corresponds to the `explicit feedback', consisting of interactions where the user has given an explicit rating to the item. Leveraging powerful results from the theory of low-rank subspace recovery, together with classic generalization bounds for matrix completion models, we show error bounds consisting of a sum of two error terms scaling as $\widetilde{O}\left(\sqrt{\frac{nd}{M}}\right)$ and $\widetilde{O}\left(\sqrt{\frac{dr}{N}}\right)$ respectively, where $d$ is the rank of $P$ and $r$ is the rank of $M$. In synthetic experiments, we confirm that the true generalization error naturally splits into independent error terms corresponding to the estimations of $P$ and and the ground truth matrix $\ground$ respectively. In real-life experiments on Douban and MovieLens with most explicit ratings removed, we demonstrate that the method can outperform baselines relying only on the explicit ratings, demonstrating that our assumptions provide a valid toy theoretical setting to study the interaction between explicit and implicit feedbacks in recommender systems.

Generalization Bounds for Semi-supervised Matrix Completion with Distributional Side Information

TL;DR

The paper tackles semi-supervised matrix completion where both the ground-truth matrix

and the sampling distribution

are low-rank and share a common subspace. It introduces DAMC, a method that first uses unlabeled samples to recover a shared low-rank subspace from the distribution over observed entries, then applies Inductive Matrix Completion (IMC) with a nuclear-norm constraint to recover

from labeled data. The main theoretical contribution is a generalization bound that decomposes into two additive terms:

for subspace estimation from unlabeled data and

for ground-truth recovery from labeled data, plus a higher-order term that vanishes under mild conditions on the sampling distribution. Empirically, the authors validate the additive decomposition on synthetic data and demonstrate that leveraging unlabeled data improves explicit-feedback prediction on real recommender-system datasets (Douban, MovieLens, Yelp), often outperforming baselines relying only on explicit ratings. This work formalizes a toy-but-principled linkage between implicit and explicit feedback through shared subspaces and provides practical guidance for leveraging abundant unlabeled interactions in recommendation tasks.

Abstract

We study a matrix completion problem where both the ground truth

matrix and the unknown sampling distribution

over observed entries are low-rank matrices, and \textit{share a common subspace}. We assume that a large amount

of \textit{unlabeled} data drawn from the sampling distribution

is available, together with a small amount

of labeled data drawn from the same distribution and noisy estimates of the corresponding ground truth entries. This setting is inspired by recommender systems scenarios where the unlabeled data corresponds to `implicit feedback' (consisting in interactions such as purchase, click, etc. ) and the labeled data corresponds to the `explicit feedback', consisting of interactions where the user has given an explicit rating to the item. Leveraging powerful results from the theory of low-rank subspace recovery, together with classic generalization bounds for matrix completion models, we show error bounds consisting of a sum of two error terms scaling as

and

respectively, where

is the rank of

and

is the rank of

. In synthetic experiments, we confirm that the true generalization error naturally splits into independent error terms corresponding to the estimations of

and and the ground truth matrix

respectively. In real-life experiments on Douban and MovieLens with most explicit ratings removed, we demonstrate that the method can outperform baselines relying only on the explicit ratings, demonstrating that our assumptions provide a valid toy theoretical setting to study the interaction between explicit and implicit feedbacks in recommender systems.

Generalization Bounds for Semi-supervised Matrix Completion with Distributional Side Information

TL;DR

Abstract

Generalization Bounds for Semi-supervised Matrix Completion with Distributional Side Information

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (1)

Theorems & Definitions (26)