Tidy simulation: Designing robust, reproducible, and scalable Monte Carlo simulations
Erik-Jan van Kesteren
TL;DR
The paper addresses the challenge of implementing Monte Carlo simulations in a robust, reproducible, and scalable way. It proposes tidy simulation, a framework organizing simulations around a tidy grid, a data-generation function, an analysis function, and a results table to standardize implementation across languages and hardware. The approach enables embarrassingly parallel execution, supports on-disk and lazy data structures for large grids, and integrates with common data analysis and visualization tools. It provides practical guidance, including ADEMP design, modular data generation and analysis, scaling strategies, and open-source templates in Python and R.
Abstract
Monte Carlo simulation studies are at the core of the modern applied, computational, and theoretical statistical literature. Simulation is a broadly applicable research tool, used to collect data on the relative performance of methods or data analysis approaches under a well-defined data-generating process. However, extant literature focuses largely on design aspects of simulation, rather than implementation strategies aligned with the current state of (statistical) programming languages, portable data formats, and multi-node cluster computing. In this work, I propose tidy simulation: a simple, language-agnostic, yet flexible functional framework for designing, writing, and running simulation studies. It has four components: a tidy simulation grid, a data generation function, an analysis function, and a results table. Using this structure, even the smallest simulations can be written in a consistent, modular way, yet they can be readily scaled to thousands of nodes in a computer cluster should the need arise. Tidy simulation also supports the iterative, sometimes exploratory nature of simulation-based experiments. By adopting the tidy simulation approach, researchers can implement their simulations in a robust, reproducible, and scalable way, which contributes to high-quality statistical science.
