Natural Language-Based Synthetic Data Generation for Cluster Analysis
Michael J. Zellinger, Peter Bühlmann
TL;DR
The paper tackles the challenge of benchmarking clustering algorithms by enabling high-level, interpretable synthetic data generation. It introduces archetype-based generation via the repliclust toolkit, where users specify broad geometric properties or describe scenarios in natural language, which are then mapped to concrete mixture-model parameters. A minimax-based overlap framework and an LDA-inspired approximation control inter-cluster overlap, while optional post-processing (distort and wrap_around_sphere) yields non-convex or directional shapes. The authors validate the approach with a mock benchmark, demonstrate clustering difficulty as a function of overlap, and provide a natural-language interface and JSON-based archetype sharing for reproducible, explainable benchmarks. The method facilitates reproducible, customizable benchmark generation by combining max-min sampling, mixture-model sampling, and optimization-based overlap control, with an emphasis on interpretability and accessibility via NLP prompts. It highlights how overlap and dimensionality interact to influence clustering performance across common algorithms, offering practical insights for benchmarking in high dimensions and with irregular cluster shapes. The open-source implementation and prompts further enable researchers to describe evaluation scenarios in English and obtain reproducible data sets. Overall, this work advances interpretable, high-level control of synthetic clustering benchmarks and provides a path toward closer alignment with real-world data characteristics.
Abstract
Cluster analysis relies on effective benchmarks for evaluating and comparing different algorithms. Simulation studies on synthetic data are popular because important features of the data sets, such as the overlap between clusters, or the variation in cluster shapes, can be effectively varied. Unfortunately, creating evaluation scenarios is often laborious, as practitioners must translate higher-level scenario descriptions like "clusters with very different shapes" into lower-level geometric parameters such as cluster centers, covariance matrices, etc. To make benchmarks more convenient and informative, we propose synthetic data generation based on direct specification of high-level scenarios, either through verbal descriptions or high-level geometric parameters. Our open-source Python package repliclust implements this workflow, making it easy to set up interpretable and reproducible benchmarks for cluster analysis. A demo of data generation from verbal inputs is available at https://demo.repliclust.org.
