Tree Pólya Splitting distributions for multivariate count data
Samuel Valiquette, Jean Peyhardi, Éric Marchand, Gwladys Toulemonde, Frédéric Mortier
TL;DR
This work addresses modeling multivariate count data with flexible dependence by introducing Tree Pólya Splitting, a generalization of Pólya Splitting that imposes a fixed partition tree to recursively split totals along nodes. The approach yields tractable univariate marginals, factorial moments, and rich covariance structures that can exhibit positive, negative, or zero correlations, all while preserving simple, divide-and-conquer inference. The authors derive detailed properties (marginals, factorial moments, covariance, correlation, and log-likelihood decomposition) and illustrate the method on a Trichoptera abundance dataset, showing competitive or superior performance relative to Poisson-lognormal and standard splitting models, with significantly fewer parameters. The framework unifies several existing multivariate discrete distributions and provides a practical path for learning tree structures from data, enabling flexible yet interpretable modeling of complex dependence in multivariate counts.
Abstract
In this article, we develop a new class of multivariate distributions adapted for count data, called Tree Pólya Splitting. This class results from the combination of a univariate distribution and singular multivariate distributions along a fixed partition tree. Known distributions, including the Dirichlet-multinomial, the generalized Dirichlet-multinomial and the Dirichlet-tree multinomial, are particular cases within this class. As we will demonstrate, these distributions are flexible, allowing for the modeling of complex dependence structures (positive, negative, or null) at the observation level. Specifically, we present the theoretical properties of Tree Pólya Splitting distributions by focusing primarily on marginal distributions, factorial moments, and dependence structures (covariance and correlations). A dataset of abundance of Trichoptera is used, on one hand, as a benchmark to illustrate the theoretical properties developed in this article, and on the other hand, to demonstrate the interest of these types of models, notably by comparing them to other approaches for fitting multivariate data, such as the Poisson-lognormal model in ecology or singular multivariate distributions used in microbiome.
