Table of Contents
Fetching ...

Beyond the Chinese Restaurant and Pitman-Yor processes: Statistical Models with Double Power-law Behavior

Fadhel Ayed, Juho Lee, François Caron

TL;DR

This work introduces a novel class of doubly regularly-varying CRMs whose normalized forms produce random partitions with a two-regime power-law in frequencies, addressing empirical evidence of double power-law behavior in language and networks. It develops two concrete instantiations, the generalized BFRY (GBFRY) and beta prime (BP) processes, and provides scalable posterior inference via MCMC augmented with latent variables to handle intractable likelihood components. The authors demonstrate that these models fit large-frequency data better than the Pitman–Yor process and related normalized CRMs across synthetic and real datasets, including word frequencies and Twitter networks. The results offer a flexible, theoretically grounded approach to modeling two-regime power-laws with potential extensions to hierarchical or graph-based Bayesian nonparametrics, advancing both theory and practical modeling of heavy-tailed phenomena.

Abstract

Bayesian nonparametric approaches, in particular the Pitman-Yor process and the associated two-parameter Chinese Restaurant process, have been successfully used in applications where the data exhibit a power-law behavior. Examples include natural language processing, natural images or networks. There is also growing empirical evidence that some datasets exhibit a two-regime power-law behavior: one regime for small frequencies, and a second regime, with a different exponent, for high frequencies. In this paper, we introduce a class of completely random measures which are doubly regularly-varying. Contrary to the Pitman-Yor process, we show that when completely random measures in this class are normalized to obtain random probability measures and associated random partitions, such partitions exhibit a double power-law behavior. We discuss in particular three models within this class: the beta prime process (Broderick et al. (2015, 2018), a novel process called generalized BFRY process, and a mixture construction. We derive efficient Markov chain Monte Carlo algorithms to estimate the parameters of these models. Finally, we show that the proposed models provide a better fit than the Pitman-Yor process on various datasets.

Beyond the Chinese Restaurant and Pitman-Yor processes: Statistical Models with Double Power-law Behavior

TL;DR

This work introduces a novel class of doubly regularly-varying CRMs whose normalized forms produce random partitions with a two-regime power-law in frequencies, addressing empirical evidence of double power-law behavior in language and networks. It develops two concrete instantiations, the generalized BFRY (GBFRY) and beta prime (BP) processes, and provides scalable posterior inference via MCMC augmented with latent variables to handle intractable likelihood components. The authors demonstrate that these models fit large-frequency data better than the Pitman–Yor process and related normalized CRMs across synthetic and real datasets, including word frequencies and Twitter networks. The results offer a flexible, theoretically grounded approach to modeling two-regime power-laws with potential extensions to hierarchical or graph-based Bayesian nonparametrics, advancing both theory and practical modeling of heavy-tailed phenomena.

Abstract

Bayesian nonparametric approaches, in particular the Pitman-Yor process and the associated two-parameter Chinese Restaurant process, have been successfully used in applications where the data exhibit a power-law behavior. Examples include natural language processing, natural images or networks. There is also growing empirical evidence that some datasets exhibit a two-regime power-law behavior: one regime for small frequencies, and a second regime, with a different exponent, for high frequencies. In this paper, we introduce a class of completely random measures which are doubly regularly-varying. Contrary to the Pitman-Yor process, we show that when completely random measures in this class are normalized to obtain random probability measures and associated random partitions, such partitions exhibit a double power-law behavior. We discuss in particular three models within this class: the beta prime process (Broderick et al. (2015, 2018), a novel process called generalized BFRY process, and a mixture construction. We derive efficient Markov chain Monte Carlo algorithms to estimate the parameters of these models. Finally, we show that the proposed models provide a better fit than the Pitman-Yor process on various datasets.

Paper Structure

This paper contains 47 sections, 10 theorems, 101 equations, 10 figures, 2 tables.

Key Result

Proposition 1

A CRM, regularly varying at 0 with exponent $\alpha>0$, satisfies where $\ell_1^*$ is a slowly varying function whose expression, which depends on $\ell_1$ and $\alpha$, is given in sec:proof_of_prop:PLCRMzero.

Figures (10)

  • Figure 1: (Top) Ranked word frequencies from the American National Corpus (circles) and power-law fit (straight lines). (Bottom) Proportion of words with a given number of occurences for the same dataset (circles) and power-law fit (straight lines).
  • Figure 2: Simulated data from the normalized GBFRY model. Proportion of clusters of a given size for (First) $\eta=4000,\tau=3,\sigma=0.2$ with varying $n$ and (Second) $\eta=4000,\tau=3,n=10^7$ with varying of $\sigma$. Ordered frequencies, normalized by the largest one, for (Third) $n=10^6,\sigma=0.2,\tau=3$ with varying $\eta$ and (Fourth) $n=10^6,\sigma=0.2,\eta=50000$ with varying $\tau$ .
  • Figure 3: Results on the ANC dataset: $95\%$ credible interval of the posterior predictive in blue, data in red. (Top) Proportion of occurences of a given size. (Bottom) Ranked occurences.
  • Figure 4: Trace plots of the parameter samples for the Generalized BFRY model. Dashed line represents true value of the parameter.
  • Figure 5: Trace plots of the parameter samples for the Beta prime process model. Dashed line represents true value of the parameter.
  • ...and 5 more figures

Theorems & Definitions (13)

  • Definition 3.1: Slowly varying function
  • Definition 3.2: Regularly varying function
  • Proposition 1
  • Proposition 2
  • Proposition 3
  • Theorem 1
  • Corollary 2
  • Remark 1
  • Theorem 3: Karamata's theorem
  • Corollary 4
  • ...and 3 more