Table of Contents
Fetching ...

The dynamics of higher-order novelties

Gabriele Di Bona, Alessandro Bellina, Giordano De Marzo, Angelo Petralia, Iacopo Iacopini, Vito Latora

TL;DR

Higher-order novelties are defined as the first time two or more elements appear together, and higher-order Heaps’ exponents are introduced as a way to characterize their pace of discovery.

Abstract

Studying how we explore the world in search of novelties is key to understand the mechanisms that can lead to new discoveries. Previous studies analyzed novelties in various exploration processes, defining them as the first appearance of an element. However, novelties can also be generated by combining what is already known. We hence define higher-order novelties as the first time two or more elements appear together, and we introduce higher-order Heaps' exponents as a way to characterize their pace of discovery. Through extensive analysis of real-world data, we find that processes with the same pace of discovery, as measured by the standard Heaps' exponent, can instead differ at higher orders. We then propose to model an exploration process as a random walk on a network in which the possible connections between elements evolve in time. The model reproduces the empirical properties of higher-order novelties, revealing how the network we explore changes over time along with the exploration process.

The dynamics of higher-order novelties

TL;DR

Higher-order novelties are defined as the first time two or more elements appear together, and higher-order Heaps’ exponents are introduced as a way to characterize their pace of discovery.

Abstract

Studying how we explore the world in search of novelties is key to understand the mechanisms that can lead to new discoveries. Previous studies analyzed novelties in various exploration processes, defining them as the first appearance of an element. However, novelties can also be generated by combining what is already known. We hence define higher-order novelties as the first time two or more elements appear together, and we introduce higher-order Heaps' exponents as a way to characterize their pace of discovery. Through extensive analysis of real-world data, we find that processes with the same pace of discovery, as measured by the standard Heaps' exponent, can instead differ at higher orders. We then propose to model an exploration process as a random walk on a network in which the possible connections between elements evolve in time. The model reproduces the empirical properties of higher-order novelties, revealing how the network we explore changes over time along with the exploration process.
Paper Structure (7 sections, 48 equations, 16 figures, 3 tables)

This paper contains 7 sections, 48 equations, 16 figures, 3 tables.

Figures (16)

  • Figure 1: Higher-order Heaps' exponents in real-world data sets. (a-c) Average number $D_n(t)$ of novelties of order $n$, with $n=1,\,2,\,3$, as a function of the sequence length $t$, and fit of the associated Heaps' laws (dashed lines), with estimated exponents shown in the legend. Shaded area represents one standard deviation above and below the average. (d-i) Scatter plots between the ($1$st-order) Heaps' exponents $\beta_1$ and the $n$th-order exponents $\beta_n$, with $n = 2$ (d-f) and $3$ (g-i). Each point refers to a different sequence, with colors representing the density of points (see color bar). Each panel also reports histograms of exponents distributions, the bisector $y=x$ (dashed gray line), as well as the fitted linear model (dotted red line) with the value of its coefficient of determination $R^2$. Each column refers to a different data set: (a,d,g) Last.fm, (b,e,h) Project Gutenberg and (c,f,i) Semantic Scholar, respectively.
  • Figure 2: Higher-order Heaps' exponents in existing models. Scatter plots of the (1st-order) Heaps' exponent $\beta_1$ against the $2$nd-order exponent $\beta_2$ in: (a) the urn model with triggering (UMT), no semantic correlations ($\eta=1$), and $\rho=20$, $\nu = 1,\,2, \dots,\, 20$; (b) the urn model with semantic triggering (UMST) with $\eta=0.1$ and $\rho=4$, $\nu = 1,\,2, \dots,\, 20$; (c) the edge-reinforced random walk (ERRW) on a small-world network (average degree $\langle k \rangle = 4$ and rewiring probability $p = 0.1$newman1999scaling) with edge reinforcement $\rho$ ranging geometrically from $0.1$ to $10$. Each point refers to a different simulation of the related model, with colors representing the value of the free parameter (see color bar). Each panel also reports histograms of exponent distributions on the respective axes, and the bisector $y=x$ (dashed gray line). All simulations have run for $10^5$ time steps.
  • Figure 3: The Edge-Reinforced Random Walk with Triggering (ERRWT) model. An exploration process is modelled as a random walk on a growing weighted network. (a) At time $t$, the walker is at the red node $i$. Nodes that have been already visited by the walker are colored in black, in white those left to be visited. Similarly, traversed (old) and not-traversed (new) links are respectively depicted with continuous and dashed lines, whose widths represent their weights. At time $t+1$, the walker can move to each of the neighbours of $i$, e.g. nodes $j$, $k$, or $l$, with a probability proportional to the weight of the respective link. (b) If the walker moves to $j$, the weight of the link $(i,\,j)$ is reinforced by $\rho$ (Edge Reinforcement mechanism), but no new nodes or links are added to the network, since the link $(i,\,j)$ is old; (c) if the walker moves to node $k$, since link $(i,\,k)$ is new but node $k$ is old, in addition to the edge reinforcement, $\nu_2 + 1 = 2$ new edges (in green) between $k$ and old nodes are added to the network (Edge Triggering mechanism); (d) finally, if the walker moves to $l$, since both the link $(i,\,l)$ and the node $l$ are new, in addition to the edge reinforcement and the edge triggering, $\nu_1+1 = 3$ new nodes (in green) are added to the network and connected to $l$ (Node and Edge Triggering mechanism).
  • Figure 4: Higher-order Heaps' exponents in the ERRWT model. (a) Average number $D_n(t)$ of novelties of order $n$, with $n=1$ and $2$, as a function of the sequence length $t$ for simulations of the ERRWT model with parameters $\rho = 10$, $\nu_1 = 10$, $\nu_2 = 15$, and fit of the associated Heaps' laws (dashed lines), with estimated exponents shown in the legend. Shaded areas represent one standard deviation above and below the average. (b) Scatter plot between the (standard) Heaps' exponent $\beta_1$ and the $2$nd-order exponent $\beta_2$. Each point refers to a different simulation of the model, with colors representing the corresponding value of the parameter $\nu_1$ ranging from 0 to 20 (see color bar), while $\rho = 10$ and $\nu_2 = 0,\,\dots,\,2\nu_1$. (c) Variation of the average $n$th-order Heaps' exponents $\beta_n$, with $n=1,\,2$. Each curve refers to a different value of $\nu_1$, increasing from 1 to 20 from bottom left to top right, while the color represents the value of $\nu_2$ (see color bar). The set of parameters used in (a) is here highlighted in with a red dot.
  • Figure 5: Fitting the ERRWT model to real-world data sets. (a) Distribution of the average distance between the pair of exponents $(\beta_1,\, \beta_2)$ of a real sequence and the pair $(\beta_1',\, \beta_2')$ obtained by the best fitting ERRWT model. (b-c) Scatter plots of the best-fitted parameters $\nu_1$ and $\nu_2$ of the model across the sequences of the three data sets, respectively Last.fm (b), Project Gutenberg (c), and Semantic Scholar (d). The color of a point refers to the number of sequences with that pair of parameters in the best fitting ERRWT model (see color bar).
  • ...and 11 more figures