Table of Contents
Fetching ...

Flexible Generation of Preference Data for Recommendation Analysis

Simone Mungari, Erica Coppolillo, Ettore Ritacco, Giuseppe Manco

TL;DR

HYDRA tackles the challenge of benchmarking recommender systems with realistic yet controllable synthetic data. It introduces a probabilistic generator that jointly models three interacting factors—User-Item Matching, User Engagement Level, and Item Popularity—through latent factors drawn from Dirichlet priors and mixtures of long-tail distributions, with variational EM-style inference guiding parameter estimation. The main contributions include a flexible community-aware data generation framework, explicit mixture modeling for engagement and popularity, and empirical evidence that synthetic data preserves real-world distributional properties and benchmarking behavior. This approach enables privacy-preserving, scalable benchmarking across diverse domains while offering tunable realism for controlled experimentation.

Abstract

Simulating a recommendation system in a controlled environment, to identify specific behaviors and user preferences, requires highly flexible synthetic data generation models capable of mimicking the patterns and trends of real datasets. In this context, we propose HYDRA, a novel preferences data generation model driven by three main factors: user-item interaction level, item popularity, and user engagement level. The key innovations of the proposed process include the ability to generate user communities characterized by similar item adoptions, reflecting real-world social influences and trends. Additionally, HYDRA considers item popularity and user engagement as mixtures of different probability distributions, allowing for a more realistic simulation of diverse scenarios. This approach enhances the model's capacity to simulate a wide range of real-world cases, capturing the complexity and variability found in actual user behavior. We demonstrate the effectiveness of HYDRA through extensive experiments on well-known benchmark datasets. The results highlight its capability to replicate real-world data patterns, offering valuable insights for developing and testing recommendation systems in a controlled and realistic manner. The code used to perform the experiments is publicly available at https://github.com/SimoneMungari/HYDRA.

Flexible Generation of Preference Data for Recommendation Analysis

TL;DR

HYDRA tackles the challenge of benchmarking recommender systems with realistic yet controllable synthetic data. It introduces a probabilistic generator that jointly models three interacting factors—User-Item Matching, User Engagement Level, and Item Popularity—through latent factors drawn from Dirichlet priors and mixtures of long-tail distributions, with variational EM-style inference guiding parameter estimation. The main contributions include a flexible community-aware data generation framework, explicit mixture modeling for engagement and popularity, and empirical evidence that synthetic data preserves real-world distributional properties and benchmarking behavior. This approach enables privacy-preserving, scalable benchmarking across diverse domains while offering tunable realism for controlled experimentation.

Abstract

Simulating a recommendation system in a controlled environment, to identify specific behaviors and user preferences, requires highly flexible synthetic data generation models capable of mimicking the patterns and trends of real datasets. In this context, we propose HYDRA, a novel preferences data generation model driven by three main factors: user-item interaction level, item popularity, and user engagement level. The key innovations of the proposed process include the ability to generate user communities characterized by similar item adoptions, reflecting real-world social influences and trends. Additionally, HYDRA considers item popularity and user engagement as mixtures of different probability distributions, allowing for a more realistic simulation of diverse scenarios. This approach enhances the model's capacity to simulate a wide range of real-world cases, capturing the complexity and variability found in actual user behavior. We demonstrate the effectiveness of HYDRA through extensive experiments on well-known benchmark datasets. The results highlight its capability to replicate real-world data patterns, offering valuable insights for developing and testing recommendation systems in a controlled and realistic manner. The code used to perform the experiments is publicly available at https://github.com/SimoneMungari/HYDRA.
Paper Structure (13 sections, 15 equations, 17 figures, 3 tables, 2 algorithms)

This paper contains 13 sections, 15 equations, 17 figures, 3 tables, 2 algorithms.

Figures (17)

  • Figure 1: Visualization of the user-item interaction matrix by varying the $\varepsilon$ parameter. The X-axis reports the users, while the Y-axis represents the items. A dot in the position ($u, i$) indicates the user $u$ interacted with item $i$.
  • Figure 2: Histograms of user interactions with a specific topic of interest. The X-axis represents the percentage of items in $I_1$ within the users history. The Y-axis shows the proportion of users having that percentage.
  • Figure 3: User/item degree distributions for the partitions obtained with $\varepsilon=.01$. The first graph shows the global degree distributions. Graphs 2 and 3 focus on the user communities $U_1$ and $U_2$, whereas 4 and 5 on item categories $I_1$ and $I_2$.
  • Figure 4: Degree distributions with the following priors: (a) Power-Law with exponential cut-off for users and items; (b) Power-Law with exponential cut-off for users and Power-Law for items; (c) Stretched Exponential for users, Power-Law for items; (d) Log-Normal distribution for users and items.
  • Figure 5: Effects of the $\zeta, \xi$ and $\lambda$ hyper-parameters on the distributions of the generated data.
  • ...and 12 more figures