Table of Contents
Fetching ...

Generating Diverse Synthetic Datasets for Evaluation of Real-life Recommender Systems

Miha Malenšek, Blaž Škrlj, Blaž Mramor, Jure Demšar

TL;DR

This work presents a modular, deterministic framework for generating completely synthetic, production-scale, high-cardinality categorical datasets tailored to evaluating real-life recommender systems, addressing privacy and data-access limitations. The core tool, CategoricalClassification, exposes features for controlled feature generation, target formation, correlations, and data augmentation, and is available as an open-source Python package (catclass) with integration in Outrank for AutoML-style feature selection. Through three use cases—benchmarking probabilistic counting algorithms, detecting algorithmic bias, and simulating AutoML searches—the framework demonstrates its ability to isolate model behavior, reveal memory-accuracy trade-offs, and reveal biases under complex feature interactions. The approach enables reproducible, scenario-driven experiments, supporting systematic evaluation and development of recommender-system pipelines, with future directions including GAN/VAE-based enrichment and extensions to regression tasks.

Abstract

Synthetic datasets are important for evaluating and testing machine learning models. When evaluating real-life recommender systems, high-dimensional categorical (and sparse) datasets are often considered. Unfortunately, there are not many solutions that would allow generation of artificial datasets with such characteristics. For that purpose, we developed a novel framework for generating synthetic datasets that are diverse and statistically coherent. Our framework allows for creation of datasets with controlled attributes, enabling iterative modifications to fit specific experimental needs, such as introducing complex feature interactions, feature cardinality, or specific distributions. We demonstrate the framework's utility through use cases such as benchmarking probabilistic counting algorithms, detecting algorithmic bias, and simulating AutoML searches. Unlike existing methods that either focus narrowly on specific dataset structures, or prioritize (private) data synthesis through real data, our approach provides a modular means to quickly generating completely synthetic datasets we can tailor to diverse experimental requirements. Our results show that the framework effectively isolates model behavior in unique situations and highlights its potential for significant advancements in the evaluation and development of recommender systems. The readily-available framework is available as a free open Python package to facilitate research with minimal friction.

Generating Diverse Synthetic Datasets for Evaluation of Real-life Recommender Systems

TL;DR

This work presents a modular, deterministic framework for generating completely synthetic, production-scale, high-cardinality categorical datasets tailored to evaluating real-life recommender systems, addressing privacy and data-access limitations. The core tool, CategoricalClassification, exposes features for controlled feature generation, target formation, correlations, and data augmentation, and is available as an open-source Python package (catclass) with integration in Outrank for AutoML-style feature selection. Through three use cases—benchmarking probabilistic counting algorithms, detecting algorithmic bias, and simulating AutoML searches—the framework demonstrates its ability to isolate model behavior, reveal memory-accuracy trade-offs, and reveal biases under complex feature interactions. The approach enables reproducible, scenario-driven experiments, supporting systematic evaluation and development of recommender-system pipelines, with future directions including GAN/VAE-based enrichment and extensions to regression tasks.

Abstract

Synthetic datasets are important for evaluating and testing machine learning models. When evaluating real-life recommender systems, high-dimensional categorical (and sparse) datasets are often considered. Unfortunately, there are not many solutions that would allow generation of artificial datasets with such characteristics. For that purpose, we developed a novel framework for generating synthetic datasets that are diverse and statistically coherent. Our framework allows for creation of datasets with controlled attributes, enabling iterative modifications to fit specific experimental needs, such as introducing complex feature interactions, feature cardinality, or specific distributions. We demonstrate the framework's utility through use cases such as benchmarking probabilistic counting algorithms, detecting algorithmic bias, and simulating AutoML searches. Unlike existing methods that either focus narrowly on specific dataset structures, or prioritize (private) data synthesis through real data, our approach provides a modular means to quickly generating completely synthetic datasets we can tailor to diverse experimental requirements. Our results show that the framework effectively isolates model behavior in unique situations and highlights its potential for significant advancements in the evaluation and development of recommender systems. The readily-available framework is available as a free open Python package to facilitate research with minimal friction.

Paper Structure

This paper contains 7 sections, 5 figures, 1 table.

Figures (5)

  • Figure 1: PCA plot (left) and feature densities (right) of a synthetic dataset with 9 features and 10000 samples generated with our framework. To demonstrate its capabilities, we generated two target vectors, one via clustering (color), and one using a custom defined decision function (size), seen in the PCA plot. Using the structure parameter, we created features with differing distributions, including the commonly seen long-tail, bimodal, and normal distributions.
  • Figure 2: Distributions of computation times over more than 2k synthetic datasets comprised of 20 features and 1m rows. All algorithms ensure error rate smaller than 0.005 (the set is exact). Small-enough hllc -- hyperloglog with caching performs with minimal error and similar times to set itself -- this is due to the fact that in most cases, it remains deterministic for most of the datasets.
  • Figure 3: AUC (left) and accuracy (right) scores of DeepFM and logistic regression after one epoch. Synthetic dataset configs are sets of pairwise relevant feature combinations -- 1: AND, 2: OR, 3: XOR, 4: AND, OR, 5: AND, OR, XOR, 6: sum of squares, 7: square of sums, 8: both square combinations, 9: AND, OR, XOR, sum of squares, 10: AND, OR, XOR, square of sums, 11: all feature combinations present.
  • Figure 4: Negative Log Loss for AutoML evolution with different dataset sizes. Evolution set 1: {OR3, OR4, AND2, XOR0, AND3, IRR50, IRR75, IRR27, AND4, IRR15, SUM_SQUARES2...} Evolution set 2: {OR1, AND4, OR4, OR3, AND2, OR0, AND0, AND5, OR2, OR5, XOR5, IRR69, AND1, AND3...} Evolution set 3: {OR5, OR1, AND0, AND5, OR3, OR2, AND4, AND3, OR4, OR0, AND1, AND2, IRR3, IRR41...}
  • Figure 5: Accuracy scores and AUC of DeepFM and logistic regression for features from AutoML and different datasets.