Generating Diverse Synthetic Datasets for Evaluation of Real-life Recommender Systems
Miha Malenšek, Blaž Škrlj, Blaž Mramor, Jure Demšar
TL;DR
This work presents a modular, deterministic framework for generating completely synthetic, production-scale, high-cardinality categorical datasets tailored to evaluating real-life recommender systems, addressing privacy and data-access limitations. The core tool, CategoricalClassification, exposes features for controlled feature generation, target formation, correlations, and data augmentation, and is available as an open-source Python package (catclass) with integration in Outrank for AutoML-style feature selection. Through three use cases—benchmarking probabilistic counting algorithms, detecting algorithmic bias, and simulating AutoML searches—the framework demonstrates its ability to isolate model behavior, reveal memory-accuracy trade-offs, and reveal biases under complex feature interactions. The approach enables reproducible, scenario-driven experiments, supporting systematic evaluation and development of recommender-system pipelines, with future directions including GAN/VAE-based enrichment and extensions to regression tasks.
Abstract
Synthetic datasets are important for evaluating and testing machine learning models. When evaluating real-life recommender systems, high-dimensional categorical (and sparse) datasets are often considered. Unfortunately, there are not many solutions that would allow generation of artificial datasets with such characteristics. For that purpose, we developed a novel framework for generating synthetic datasets that are diverse and statistically coherent. Our framework allows for creation of datasets with controlled attributes, enabling iterative modifications to fit specific experimental needs, such as introducing complex feature interactions, feature cardinality, or specific distributions. We demonstrate the framework's utility through use cases such as benchmarking probabilistic counting algorithms, detecting algorithmic bias, and simulating AutoML searches. Unlike existing methods that either focus narrowly on specific dataset structures, or prioritize (private) data synthesis through real data, our approach provides a modular means to quickly generating completely synthetic datasets we can tailor to diverse experimental requirements. Our results show that the framework effectively isolates model behavior in unique situations and highlights its potential for significant advancements in the evaluation and development of recommender systems. The readily-available framework is available as a free open Python package to facilitate research with minimal friction.
