Table of Contents
Fetching ...

Tidy simulation: Designing robust, reproducible, and scalable Monte Carlo simulations

Erik-Jan van Kesteren

TL;DR

The paper addresses the challenge of implementing Monte Carlo simulations in a robust, reproducible, and scalable way. It proposes tidy simulation, a framework organizing simulations around a tidy grid, a data-generation function, an analysis function, and a results table to standardize implementation across languages and hardware. The approach enables embarrassingly parallel execution, supports on-disk and lazy data structures for large grids, and integrates with common data analysis and visualization tools. It provides practical guidance, including ADEMP design, modular data generation and analysis, scaling strategies, and open-source templates in Python and R.

Abstract

Monte Carlo simulation studies are at the core of the modern applied, computational, and theoretical statistical literature. Simulation is a broadly applicable research tool, used to collect data on the relative performance of methods or data analysis approaches under a well-defined data-generating process. However, extant literature focuses largely on design aspects of simulation, rather than implementation strategies aligned with the current state of (statistical) programming languages, portable data formats, and multi-node cluster computing. In this work, I propose tidy simulation: a simple, language-agnostic, yet flexible functional framework for designing, writing, and running simulation studies. It has four components: a tidy simulation grid, a data generation function, an analysis function, and a results table. Using this structure, even the smallest simulations can be written in a consistent, modular way, yet they can be readily scaled to thousands of nodes in a computer cluster should the need arise. Tidy simulation also supports the iterative, sometimes exploratory nature of simulation-based experiments. By adopting the tidy simulation approach, researchers can implement their simulations in a robust, reproducible, and scalable way, which contributes to high-quality statistical science.

Tidy simulation: Designing robust, reproducible, and scalable Monte Carlo simulations

TL;DR

The paper addresses the challenge of implementing Monte Carlo simulations in a robust, reproducible, and scalable way. It proposes tidy simulation, a framework organizing simulations around a tidy grid, a data-generation function, an analysis function, and a results table to standardize implementation across languages and hardware. The approach enables embarrassingly parallel execution, supports on-disk and lazy data structures for large grids, and integrates with common data analysis and visualization tools. It provides practical guidance, including ADEMP design, modular data generation and analysis, scaling strategies, and open-source templates in Python and R.

Abstract

Monte Carlo simulation studies are at the core of the modern applied, computational, and theoretical statistical literature. Simulation is a broadly applicable research tool, used to collect data on the relative performance of methods or data analysis approaches under a well-defined data-generating process. However, extant literature focuses largely on design aspects of simulation, rather than implementation strategies aligned with the current state of (statistical) programming languages, portable data formats, and multi-node cluster computing. In this work, I propose tidy simulation: a simple, language-agnostic, yet flexible functional framework for designing, writing, and running simulation studies. It has four components: a tidy simulation grid, a data generation function, an analysis function, and a results table. Using this structure, even the smallest simulations can be written in a consistent, modular way, yet they can be readily scaled to thousands of nodes in a computer cluster should the need arise. Tidy simulation also supports the iterative, sometimes exploratory nature of simulation-based experiments. By adopting the tidy simulation approach, researchers can implement their simulations in a robust, reproducible, and scalable way, which contributes to high-quality statistical science.

Paper Structure

This paper contains 13 sections, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Four sequential components of a tidy simulation. The simulation grid is a tidy data frame with the settings for each iteration of the simulation, the generation function creates simulated data through random sampling, the analysis function applies the methods of interest to this data, and the results table compiles the outputs from the analysis, again in a tidy data format.
  • Figure 2: Tidy simulations are embarrassingly parallel, as each row in the simulation grid and results table represents a single simulation setting (tidy data), meaning it is independent of the others.
  • Figure 3: Result of the example simulation, showing that the uncorrected post-measurement outcome analysis does not perform well in terms of power, and the other three methods have similar power for low and medium effect sizes. With a large effect size, the uncorrected change-score analysis has highest power to detect the treatment effect. Note that these conclusions hold only for the specific data-generating process under investigation in the example simulation.