Multivariate Species Sampling Models
Beatrice Franzolini, Antonio Lijoi, Igor Prünster, Giovanni Rebaudo
TL;DR
The paper addresses the rigidity of exchangeability in multi-population data by introducing multivariate species sampling processes (mSSP), a unifying framework for dependent nonparametric priors under partial exchangeability. It develops a complete theory around the dependence structure via the partially exchangeable partition function (pEPPF) and its predictive mechanism (mgCRP), with a principal focus on the regular rmSSP subclass that captures most BNP models used in practice. The authors show that borrowing information across populations is governed entirely by shared ties, tying learning mechanisms to observable tie structures, and they provide constructive representations to build new models. The framework is demonstrated through real-tree data and synthetic experiments in a multi-armed bandit setting, highlighting improved species-discovery performance when information-sharing is appropriate and robustness when it is not, with open-source code for replication.
Abstract
Species sampling processes have long served as the fundamental framework for modeling random discrete distributions and exchangeable sequences. However, data arising from distinct but related sources require a broader notion of probabilistic invariance, making partial exchangeability a natural choice. Countless models for partially exchangeable data, collectively known as dependent nonparametric priors, have been proposed. These include hierarchical, nested and additive processes, widely used in statistics and machine learning. Still, a unifying framework is lacking and key questions about their underlying learning mechanisms remain unanswered. We fill this gap by introducing multivariate species sampling models, a new general class of nonparametric priors that encompasses most existing finite- and infinite-dimensional dependent processes. They are characterized by the induced partially exchangeable partition probability function encoding their multivariate clustering structure. We establish their core distributional properties and analyze their dependence structure, demonstrating that borrowing of information across groups is entirely determined by shared ties. This provides new insights into the underlying learning mechanisms, offering, for instance, a principled rationale for the previously unexplained correlation structure observed in existing models. Beyond providing a cohesive theoretical foundation, our approach serves as a constructive tool for developing new models and opens novel research directions for capturing richer dependence structures beyond the framework of multivariate species sampling processes.
