Flexible Generation of Preference Data for Recommendation Analysis
Simone Mungari, Erica Coppolillo, Ettore Ritacco, Giuseppe Manco
TL;DR
HYDRA tackles the challenge of benchmarking recommender systems with realistic yet controllable synthetic data. It introduces a probabilistic generator that jointly models three interacting factors—User-Item Matching, User Engagement Level, and Item Popularity—through latent factors drawn from Dirichlet priors and mixtures of long-tail distributions, with variational EM-style inference guiding parameter estimation. The main contributions include a flexible community-aware data generation framework, explicit mixture modeling for engagement and popularity, and empirical evidence that synthetic data preserves real-world distributional properties and benchmarking behavior. This approach enables privacy-preserving, scalable benchmarking across diverse domains while offering tunable realism for controlled experimentation.
Abstract
Simulating a recommendation system in a controlled environment, to identify specific behaviors and user preferences, requires highly flexible synthetic data generation models capable of mimicking the patterns and trends of real datasets. In this context, we propose HYDRA, a novel preferences data generation model driven by three main factors: user-item interaction level, item popularity, and user engagement level. The key innovations of the proposed process include the ability to generate user communities characterized by similar item adoptions, reflecting real-world social influences and trends. Additionally, HYDRA considers item popularity and user engagement as mixtures of different probability distributions, allowing for a more realistic simulation of diverse scenarios. This approach enhances the model's capacity to simulate a wide range of real-world cases, capturing the complexity and variability found in actual user behavior. We demonstrate the effectiveness of HYDRA through extensive experiments on well-known benchmark datasets. The results highlight its capability to replicate real-world data patterns, offering valuable insights for developing and testing recommendation systems in a controlled and realistic manner. The code used to perform the experiments is publicly available at https://github.com/SimoneMungari/HYDRA.
