S-Diff: An Anisotropic Diffusion Model for Collaborative Filtering in Spectral Domain
Rui Xia, Yanhua Cheng, Yongxiang Tang, Xiaocheng Liu, Xialong Liu, Lisong Wang, Peng Jiang
TL;DR
This work tackles the challenge that diffusion-based collaborative filtering models struggle to exploit cross-user shared preferences and suffer SNR loss in the forward process. It introduces S-Diff, an anisotropic diffusion model defined in the graph spectral domain, where diffusion noise is aligned with the eigenvalues of the item-item Laplacian to preserve low-frequency components that encode global user preferences. A FiLM-based conditional denoiser is trained to recover true user preferences from spectral-domain noise, and classifier-free guidance is employed to balance conditioning with diversity. Empirically, S-Diff achieves state-of-the-art recall and ranking metrics across MovieLens-1M, Yelp, and Amazon-Book, while maintaining stability by leveraging spectral information and a bounded noise schedule. The approach highlights the practical value of graph-spectral diffusion for scalable, structure-aware collaborative filtering with potential for broader graph-based recommender systems.
Abstract
Recovering user preferences from user-item interaction matrices is a key challenge in recommender systems. While diffusion models can sample and reconstruct preferences from latent distributions, they often fail to capture similar users' collective preferences effectively. Additionally, latent variables degrade into pure Gaussian noise during the forward process, lowering the signal-to-noise ratio, which in turn degrades performance. To address this, we propose S-Diff, inspired by graph-based collaborative filtering, better to utilize low-frequency components in the graph spectral domain. S-Diff maps user interaction vectors into the spectral domain and parameterizes diffusion noise to align with graph frequency. This anisotropic diffusion retains significant low-frequency components, preserving a high signal-to-noise ratio. S-Diff further employs a conditional denoising network to encode user interactions, recovering true preferences from noisy data. This method achieves strong results across multiple datasets.
