Table of Contents
Fetching ...

Generation of Multivariate Discrete Data with Generalized Poisson, Negative Binomial and Binomial Marginal Distributions

Chak Kwong, Cheng, Hakan Demirtas

TL;DR

This work introduces a flexible framework to generate multivariate discrete data with marginal distributions from the generalized Poisson, negative binomial, and binomial families, requiring only marginals and a feasible correlation matrix rather than a full joint model. It extends Demirtas's ordinal-data generator by collapsing marginals to binary indicators, calibrating intermediate correlations, and reverse-collapsing to the original scales, thereby enabling a wide range of correlation structures. The method is validated through extensive simulations across four scenarios and demonstrated on three real datasets, showing accurate parameter recovery, controlled biases, and robust coverage; an accompanying R package, MultiDiscreteRNG, implements the framework and supports mixed marginals. The approach offers a practical, scalable tool for simulating realistic, correlated discrete data with broad applicability in epidemiology, social science, genomics, and environmental studies.

Abstract

The analysis of multivariate discrete data is crucial in various scientific research areas, such as epidemiology, the social sciences, genomics, and environmental studies. As the availability of such data increases, developing robust analytical and data generation tools is necessary to understand the relationships among variables. This paper builds upon previous work on data generation frameworks for multivariate ordinal data with a prespecified correlation matrix. The proposed algorithm generates multivariate discrete data from marginal distributions that follow the generalized Poisson, negative binomial, and binomial distributions. A step-by-step algorithm is provided, and its performance is illustrated in four simulated data scenarios and three real-data scenarios. This technique has the potential to be applied in a wide range of settings involving the generation of correlated discrete data.

Generation of Multivariate Discrete Data with Generalized Poisson, Negative Binomial and Binomial Marginal Distributions

TL;DR

This work introduces a flexible framework to generate multivariate discrete data with marginal distributions from the generalized Poisson, negative binomial, and binomial families, requiring only marginals and a feasible correlation matrix rather than a full joint model. It extends Demirtas's ordinal-data generator by collapsing marginals to binary indicators, calibrating intermediate correlations, and reverse-collapsing to the original scales, thereby enabling a wide range of correlation structures. The method is validated through extensive simulations across four scenarios and demonstrated on three real datasets, showing accurate parameter recovery, controlled biases, and robust coverage; an accompanying R package, MultiDiscreteRNG, implements the framework and supports mixed marginals. The approach offers a practical, scalable tool for simulating realistic, correlated discrete data with broad applicability in epidemiology, social science, genomics, and environmental studies.

Abstract

The analysis of multivariate discrete data is crucial in various scientific research areas, such as epidemiology, the social sciences, genomics, and environmental studies. As the availability of such data increases, developing robust analytical and data generation tools is necessary to understand the relationships among variables. This paper builds upon previous work on data generation frameworks for multivariate ordinal data with a prespecified correlation matrix. The proposed algorithm generates multivariate discrete data from marginal distributions that follow the generalized Poisson, negative binomial, and binomial distributions. A step-by-step algorithm is provided, and its performance is illustrated in four simulated data scenarios and three real-data scenarios. This technique has the potential to be applied in a wide range of settings involving the generation of correlated discrete data.
Paper Structure (20 sections, 16 equations, 11 tables)