Empirical Growing Networks vs Minimal Models: Evidence and Challenges from Software Heritage and APS Citation Datasets
Guillaume Rousseau
TL;DR
The paper tackles how to meaningfully compare empirical growing networks with minimal models despite heterogeneity, partial temporal data, and non-stationary dynamics. It introduces temporal and topological partitioning to build derived graphs (temporal graphs and $O:TSL(\delta_m)$-based graphs) that isolate growth mechanisms such as edge creation, inheritance, and aging, and compares these to a modified Barabási–Albert (Price) model with $m=2$ outgoing edges. Applying this framework to Software Heritage and the APS citation network reveals regime shifts (around 2008–2011 for SWH and around 1985 for APS) that complicate claims of a stationary scale-free regime and show the need for robust, causal modeling of transient growth regimes. The study concludes that refined tools and minimal causal growth models are essential for robust comparisons across empirical networks and for understanding how regime changes shape observed degree distributions and edge-difference statistics.
Abstract
We investigate the evolution rules and degree distribution properties of the Software Heritage dataset, a large-scale growing network linking software source-code versions from open-source communities. The network spans more than 40 years and includes about 6 billion nodes and edges. Our analysis relies on deterministic temporal and topological partitions of nodes and edges, which account for the multilayer and partially timestamped structure of the main graph. We derive a temporal graph that reveals a mesoscale structure and enables the study of edge dynamics--creation, inheritance, and aging--together with comparisons to minimal models using degree distributions and histograms of edge timestamp differences. The temporal graph also exposes regime shifts that correlate with changes in developer practices, as reflected in the average number of edges per new node. We estimate scaling exponents under the scale-free hypothesis and highlight the sensitivity of the estimation method used to both regime shifts and outliers, while showing that partitioning improves regularity and helps disentangle these effects. We extend the analysis to the APS citation network, which also exhibits a major regime shift, with an accelerated growth regime becoming dominant after 1985. Although both datasets are a priori good candidates for advanced quantitative analysis, our results illustrate how structural and dynamical transitions hamper our ability to draw firm conclusions about the existence and observability of a scale-free regime in these empirical networks. These findings underscore the need for refined tools and models to study transient growth regimes, to extend current frameworks toward minimal causal growth models, and to enable robust comparisons between empirical growing networks and minimal models.
