Multi-Resolution Diffusion for Privacy-Sensitive Recommender Systems
Derek Lilienthal, Paul Mello, Magdalini Eirinaki, Stas Tiomkin
TL;DR
This work addresses the privacy and data-sparsity challenges of training recommender systems by generating high-quality synthetic datasets. It introduces SDRM, a two-stage approach that maps user-item interactions into a latent Gaussian space via a pretrained MultiVAE and then applies a score-based diffusion model to denoise and sample new data before decoding back to the original space. The method demonstrates substantial improvements over baselines in both augmenting real data and substituting synthetic data, with average recalls and ranking gains (e.g., roughly 4.3% overall Recall@k and 4.6% NDCG@k) while preserving privacy (≈99% dissimilarity to the original data). The combination of a diffusion process with variational inference leverages the strengths of both paradigms to capture intricate user preferences, offering a practical route to privacy-preserving, data-efficient recommender systems. The work establishes diffusion-based synthetic data generation as a viable alternative to traditional privacy techniques, with implications for industry deployments where data sharing is constrained by regulations.
Abstract
While recommender systems have become an integral component of the Web experience, their heavy reliance on user data raises privacy and security concerns. Substituting user data with synthetic data can address these concerns, but accurately replicating these real-world datasets has been a notoriously challenging problem. Recent advancements in generative AI have demonstrated the impressive capabilities of diffusion models in generating realistic data across various domains. In this work we introduce a Score-based Diffusion Recommendation Module (SDRM), which captures the intricate patterns of real-world datasets required for training highly accurate recommender systems. SDRM allows for the generation of synthetic data that can replace existing datasets to preserve user privacy, or augment existing datasets to address excessive data sparsity. Our method outperforms competing baselines such as generative adversarial networks, variational autoencoders, and recently proposed diffusion models in synthesizing various datasets to replace or augment the original data by an average improvement of 4.30% in Recall@k and 4.65% in NDCG@k.
