Table of Contents
Fetching ...

Universalities in the Avalanche Dynamics of Novelties and Non-Novelties

Filippo Santoro, Alberto Petri, Francesca Tria

TL;DR

This work introduces avalanche statistics as a central diagnostic for innovation dynamics, extending beyond Heaps', Zipf's, and Taylor's laws to characterize sequences of novelties and non-novelties. Using urn-based models (UMT and its exchangeable variant UMT-E, plus the semantically enhanced UMST), the authors derive closed-form Heaps laws, exact avalanche-size distributions, and a scaling relation that collapses inter-event statistics across diverse real-world datasets. The results reveal a universality in novelty dynamics across multiple social and cultural systems, while also showing that heavy-tailed inter-event times in corpora like Gutenberg and Wikipedia reflect superposed, topic-driven dynamics beyond a single collective process. These insights connect fundamental statistical laws to micro-dynamics of novelty production and provide analytical tools to discriminate between single versus multi-agent contributions in complex innovation processes.

Abstract

Unprecedented events intertwine with the repetition of the past in natural phenomena and human activities. Key statistical patterns, such as Heaps' and Taylor's laws and Zipf's law, have been identified as characterizing the dynamical processes that govern the emergence of novelties and the abundance of repeated elements. Observing these statistical regularities has been pivotal in motivating the search for modeling schemes that can explain them and clarify key mechanisms underlying the appearance of new elements and their subsequent recurrence. In this study, we analyze sequences of novel and non-novel elements, referred to as avalanches, in real-world systems. We show that avalanche statistics provide a complementary characterization of innovation dynamics, extending beyond the three fundamental laws mentioned above. Although arising from collective dynamics, some systems behave as a single instance of a stochastic process. Others, such as natural language, exhibit features that we can only explain by a superposition of different dynamics. This distinction is not apparent when considering Heaps' law alone, while it clearly emerges in the avalanche statistics. By interpreting these empirical observations, we also advance the theoretical understanding of urn-based models that successfully reproduce the observed behaviors associated with Heaps', Zipf's, and Taylor's laws. We derive analytical expressions that accurately describe the probability distributions of avalanches and the Heaps law beyond its asymptotic regime. Building on these results, we derive a scaling relation that we show also holds in real-world systems, indicating a form of universality in the dynamics of novelty.

Universalities in the Avalanche Dynamics of Novelties and Non-Novelties

TL;DR

This work introduces avalanche statistics as a central diagnostic for innovation dynamics, extending beyond Heaps', Zipf's, and Taylor's laws to characterize sequences of novelties and non-novelties. Using urn-based models (UMT and its exchangeable variant UMT-E, plus the semantically enhanced UMST), the authors derive closed-form Heaps laws, exact avalanche-size distributions, and a scaling relation that collapses inter-event statistics across diverse real-world datasets. The results reveal a universality in novelty dynamics across multiple social and cultural systems, while also showing that heavy-tailed inter-event times in corpora like Gutenberg and Wikipedia reflect superposed, topic-driven dynamics beyond a single collective process. These insights connect fundamental statistical laws to micro-dynamics of novelty production and provide analytical tools to discriminate between single versus multi-agent contributions in complex innovation processes.

Abstract

Unprecedented events intertwine with the repetition of the past in natural phenomena and human activities. Key statistical patterns, such as Heaps' and Taylor's laws and Zipf's law, have been identified as characterizing the dynamical processes that govern the emergence of novelties and the abundance of repeated elements. Observing these statistical regularities has been pivotal in motivating the search for modeling schemes that can explain them and clarify key mechanisms underlying the appearance of new elements and their subsequent recurrence. In this study, we analyze sequences of novel and non-novel elements, referred to as avalanches, in real-world systems. We show that avalanche statistics provide a complementary characterization of innovation dynamics, extending beyond the three fundamental laws mentioned above. Although arising from collective dynamics, some systems behave as a single instance of a stochastic process. Others, such as natural language, exhibit features that we can only explain by a superposition of different dynamics. This distinction is not apparent when considering Heaps' law alone, while it clearly emerges in the avalanche statistics. By interpreting these empirical observations, we also advance the theoretical understanding of urn-based models that successfully reproduce the observed behaviors associated with Heaps', Zipf's, and Taylor's laws. We derive analytical expressions that accurately describe the probability distributions of avalanches and the Heaps law beyond its asymptotic regime. Building on these results, we derive a scaling relation that we show also holds in real-world systems, indicating a form of universality in the dynamics of novelty.

Paper Structure

This paper contains 10 sections, 15 equations, 11 figures.

Figures (11)

  • Figure 1: Cartoon of a sequence $S$ of events (flowing from left to right) where we highlight avalanches of novelties and avalanches of non-novelties. Top: coloured balls encircled with dotted lines represent novelties, i.e., elements that have never appeared in the sequence until that moment; coloured balls with no contour line represent non-novelties, i.e., elements already present. Each color represents a distinct element that, depending on the context, can be: a word, a song, an edit, a hashtag, etc. Bottom: same as in the top, but we only characterize elements by being a novelty (white balls) or a non-novelty (red balls).
  • Figure 2: Avalanche size distribution for novelties. We show results for the empirical datasets, contrasting them with predictions from the UMT-E model, as obtained through analytical computation and numerical simulations. The parameters used for the UMT-E model predictions are those obtained from the fit of the Heaps law in the empirical datasets, and we report them in brackets for each dataset. a) Github Users $(\nu=4,\rho=7,N_{0}=45468)$. b) Github Repositories $(\nu=22,\rho=29,N_{0}=14640)$. c) Twitter Users $(\nu=16,\rho=29,N_{0}=42183043)$. d) Twitter Hashtags$(\nu=19,\rho=22,N_{0}=1995)$. e) LastFM$(\nu=28,\rho=41,N_{0}=27200)$. f) The Wikipedia Corpus $(\nu=31,\rho=50,N_{0}=25000)$. g) the Gutenberg Corpus $(\nu=9,\rho=19,N_{0}=5890)$. We have truncated the sequences of both Gutenberg and Wikipedia at length $2 \cdot 10^{7}$.
  • Figure 3: Avalanche size distribution for non-novelties. We show results for the empirical datasets, contrasting them with predictions from the UMT-E model, as obtained through analytical computation and numerical simulations. The parameters used for the UMT-E model predictions are those obtained from the fit of the Heaps law in the empirical datasets, and we report them in brackets for each dataset. a) Github Users $(\nu=4,\rho=7,N_{0}=45468)$. b) Github Repositories $(\nu=22,\rho=29,N_{0}=14640)$. c) Twitter Users $(\nu=16,\rho=29,N_{0}=42183043)$. d) Twitter Hashtags $(\nu=19,\rho=22,N_{0}=1995)$. e) LastFM $(\nu=28,\rho=41,N_{0}=27200)$. f) The Wikipedia Corpus $(\nu=31,\rho=50,N_{0}=25000)$. g) the Gutenberg Corpus $(\nu=9,\rho=19,N_{0}=5890)$. We have truncated the sequences of both Gutenberg and Wikipedia at length $2 \cdot 10^{7}$.
  • Figure 4: Avalanche size distribution for non-novelties in the natural language corpora. Top: We contrast results for the empirical datasets with predictions from the UMT-E model obtained through analytical computation and numerical simulations. The parameters used for the UMT-E model predictions, reported in brackets, are those obtained from the fit of the Heaps law in the corresponding empirical data: a) A single Wikipedia page $(\nu=4,\rho=7 , N_{0}=45468)$; b) a single book $(\nu=50,\rho=97, N_{0}=22912)$. Center: We contrast results for the empirical datasets, i.e., c) the Wikipedia corpus and d) the Gutenberg corpus, with predictions from numerical simulations of multiple realizations of the UMT-E model, stacked one after the other, as described in the main text. In the case of Wikipedia, we have considered sequences with more than 500 words. Bottom: We contrast results for the empirical datasets with predictions from the UMT-E model obtained through analytical computation and numerical simulations. The parameters used for the UMT-E model predictions, reported in brackets, are those obtained from the fit of the avalanche size distribution for non-novelties in the corresponding empirical data: e) The Wikipedia corpus $(\nu=2,\rho=9, N_{0}=400000)$; f) a single book $(\nu=23,\rho=73, N_{0}=127020)$.
  • Figure 5: Cartoon of the UMT process and its exchangeable version, UMT-E.a) The last extracted element (the cyan ball) has already appeared in the sequence, i.e, is a non-novelty. Only the reinforcement process occurs: we insert $\rho$ copies of the cyan ball into the urn in both models. b) The last extracted element (the lime ball) has never appeared in the sequence, i.e, it is a novelty. In this case, both reinforcement and triggering processes occur. The triggering process is identical for both models: we insert $\nu+1$ new distinct elements, whose colors were not present in the urn, into it. Conversely, the reinforcement process differentiates the two models: in UMT, we introduce $\rho$ copies of the last appended ball, while in UMT-E, we only introduce $\tilde{\rho}=\rho-(\nu+1)$ copies of it. Note that in the UMT-E model, we introduce the same number of balls in the urn at each step, independently of the last extracted ball.
  • ...and 6 more figures