Table of Contents
Fetching ...

Empirical Growing Networks vs Minimal Models: Evidence and Challenges from Software Heritage and APS Citation Datasets

Guillaume Rousseau

TL;DR

The paper tackles how to meaningfully compare empirical growing networks with minimal models despite heterogeneity, partial temporal data, and non-stationary dynamics. It introduces temporal and topological partitioning to build derived graphs (temporal graphs and $O:TSL(\delta_m)$-based graphs) that isolate growth mechanisms such as edge creation, inheritance, and aging, and compares these to a modified Barabási–Albert (Price) model with $m=2$ outgoing edges. Applying this framework to Software Heritage and the APS citation network reveals regime shifts (around 2008–2011 for SWH and around 1985 for APS) that complicate claims of a stationary scale-free regime and show the need for robust, causal modeling of transient growth regimes. The study concludes that refined tools and minimal causal growth models are essential for robust comparisons across empirical networks and for understanding how regime changes shape observed degree distributions and edge-difference statistics.

Abstract

We investigate the evolution rules and degree distribution properties of the Software Heritage dataset, a large-scale growing network linking software source-code versions from open-source communities. The network spans more than 40 years and includes about 6 billion nodes and edges. Our analysis relies on deterministic temporal and topological partitions of nodes and edges, which account for the multilayer and partially timestamped structure of the main graph. We derive a temporal graph that reveals a mesoscale structure and enables the study of edge dynamics--creation, inheritance, and aging--together with comparisons to minimal models using degree distributions and histograms of edge timestamp differences. The temporal graph also exposes regime shifts that correlate with changes in developer practices, as reflected in the average number of edges per new node. We estimate scaling exponents under the scale-free hypothesis and highlight the sensitivity of the estimation method used to both regime shifts and outliers, while showing that partitioning improves regularity and helps disentangle these effects. We extend the analysis to the APS citation network, which also exhibits a major regime shift, with an accelerated growth regime becoming dominant after 1985. Although both datasets are a priori good candidates for advanced quantitative analysis, our results illustrate how structural and dynamical transitions hamper our ability to draw firm conclusions about the existence and observability of a scale-free regime in these empirical networks. These findings underscore the need for refined tools and models to study transient growth regimes, to extend current frameworks toward minimal causal growth models, and to enable robust comparisons between empirical growing networks and minimal models.

Empirical Growing Networks vs Minimal Models: Evidence and Challenges from Software Heritage and APS Citation Datasets

TL;DR

The paper tackles how to meaningfully compare empirical growing networks with minimal models despite heterogeneity, partial temporal data, and non-stationary dynamics. It introduces temporal and topological partitioning to build derived graphs (temporal graphs and -based graphs) that isolate growth mechanisms such as edge creation, inheritance, and aging, and compares these to a modified Barabási–Albert (Price) model with outgoing edges. Applying this framework to Software Heritage and the APS citation network reveals regime shifts (around 2008–2011 for SWH and around 1985 for APS) that complicate claims of a stationary scale-free regime and show the need for robust, causal modeling of transient growth regimes. The study concludes that refined tools and minimal causal growth models are essential for robust comparisons across empirical networks and for understanding how regime changes shape observed degree distributions and edge-difference statistics.

Abstract

We investigate the evolution rules and degree distribution properties of the Software Heritage dataset, a large-scale growing network linking software source-code versions from open-source communities. The network spans more than 40 years and includes about 6 billion nodes and edges. Our analysis relies on deterministic temporal and topological partitions of nodes and edges, which account for the multilayer and partially timestamped structure of the main graph. We derive a temporal graph that reveals a mesoscale structure and enables the study of edge dynamics--creation, inheritance, and aging--together with comparisons to minimal models using degree distributions and histograms of edge timestamp differences. The temporal graph also exposes regime shifts that correlate with changes in developer practices, as reflected in the average number of edges per new node. We estimate scaling exponents under the scale-free hypothesis and highlight the sensitivity of the estimation method used to both regime shifts and outliers, while showing that partitioning improves regularity and helps disentangle these effects. We extend the analysis to the APS citation network, which also exhibits a major regime shift, with an accelerated growth regime becoming dominant after 1985. Although both datasets are a priori good candidates for advanced quantitative analysis, our results illustrate how structural and dynamical transitions hamper our ability to draw firm conclusions about the existence and observability of a scale-free regime in these empirical networks. These findings underscore the need for refined tools and models to study transient growth regimes, to extend current frameworks toward minimal causal growth models, and to enable robust comparisons between empirical growing networks and minimal models.
Paper Structure (20 sections, 10 figures, 1 table, 6 algorithms)

This paper contains 20 sections, 10 figures, 1 table, 6 algorithms.

Figures (10)

  • Figure 1: Graph representation of the SWH dataset studied here (the main graph), where nodes represent software versions ( releases/ revisions) and artifacts produced by projects across various origins/ forges. Developers can act as authors and/or committers within these projects. Release and revision nodes include native temporal attributes linked to committer or author dates. Edge directions follow multilayer rules and may depend on the nodes' intrinsic identifiers. The lower layers associated with $RV$ and $RL$ nodes form a directed acyclic graph (DAG).
  • Figure 2: Overview of the graph processing pipeline used in this work. Starting from the SWH main graph, we extract several subgraphs (blue frames) corresponding to nodes with native temporal attributes and define a derived temporal graph (red frames) by partitioning $RV$ and $RL$ nodes according to existing paths and origin sizes, and by propagating their temporal information to the corresponding origin nodes. Two parameters allow us to build variants of the temporal graph using different inheritance and edge-orientation rules. The resulting temporal graph is then transformed into a TSL graph through topological partitioning.
  • Figure 3: New nodes (top) and edges (bottom) per month by type ($RV$: revision, $RL$: release) from 1970 to 2030 in the main graph of SWH dataset (exported March 2021, dashed line). Exponential growth is observed, except for $RL$ nodes and the associated $RL{\to}RL$ and $RL{\to}RV$ edges, which exhibit a constant rate since early 2014 (third dotted line). The appearance of $RL{\to}RL$ edges aligns with the adoption of $git$ and the launch of $github.com$ in 2008 (first dotted line). Plain vertical lines indicate January 1st of each year from 2017 to 2021. Anomalies at the end of 2017 and 15 months before export suggest biases due to SWH crawling policies. Post-export nodes highlight temporal data issues (see Supplemental Material).
  • Figure 4: (Top) Number of new $RV$ nodes and $RV{\to}RV$ edges per month, distinguishing nodes with outgoing edges ($\delta_{out}>0$) and without ($\delta_{out}=0$). (Bottom) Rate comparison of new edges per new $RV$ node when considering all nodes (orange) and when restricting to nodes with $\delta_{out}>0$ (blue). This partitioning reveals an exponential growth from the mid-2000s to 2013, followed by a constant rate after 2014. In the bottom panel of Fig. \ref{['fig:edgesnodes']}, this rate matches that of new $RL{\to}RV$ edges (orange dots) but not that of $RV{\to}RV$ edges (blue dots). The post-2014 decrease in the $RV{\to}RV/RV$ rate reflects the faster growth of $RV$ nodes without outgoing edges ($\delta_{out}=0$) compared to those with at least one outgoing edge.
  • Figure 5: Adjacency diagram of the TSL graph. This representation shows the weights of the different $TSL$-type origin nodes ($\delta_m = 1$). Self-loops are included in the edge-weight normalization, which explains why the sum is smaller than 100%. Percentages in parentheses correspond to the ratio of $RV$ and $RL$ nodes assigned after partitioning by $TSL$ type. Origin nodes of type $111$ and $101$ account for only a small fraction of all origin nodes (1% and 5%, respectively), despite playing a central role in the network's growth. In contrast, nodes of type $001$, which represent approximately 40% of all origin nodes and 48% of $RV/RL$ nodes, act primarily as reservoir nodes for $101$ and $111$ nodes, which together account for about 6% of origin nodes and 46% of $RV/RL$ nodes.
  • ...and 5 more figures