Table of Contents
Fetching ...

Understanding Main Path Analysis

H. C. W. Price, T. S. Evans

TL;DR

This work addresses the lack of theoretical grounding in Main Path Analysis by establishing an information-theoretic and geometric basis for edge weights and path selection. It introduces a basket-based framework using generalised criticality to capture near-optimal and diverse core nodes, demonstrating robustness and scalability across artificial DAGs and real-world networks. The study shows that traditional SPC/SPE single-path methods offer little advantage over simpler unit-weight approaches, and that baskets effectively summarize the backbone of knowledge flows. Overall, the paper provides a practical, interpretable methodology for identifying key knowledge structures in large DAGs, with broad implications for bibliometrics and network science.

Abstract

Main path analysis has long been used to trace knowledge trajectories in citation networks, yet it lacks solid theoretical foundations. To understand when and why this approach succeeds, we analyse directed acyclic graphs created from two types of artificial models and by looking at over twenty networks derived from real data. We show that entropy-based variants of main path analysis optimise geometric distance measures, providing its first information-theoretic and geometric basis. Numerical results demonstrate that existing algorithms converge on near-geodesic solutions. We also show that an approach based on longest paths produces similar results, is equally well motivated yet is much simpler to implement. However, the traditional single-path focus is unnecessarily restrictive, as many near-optimal paths highlight different key nodes. We introduce an approach using ``baskets'' of nodes where we select a fraction of nodes with the smallest values of a measure we call ``generalised criticality''. Analysis of large vaccine citation networks shows that these baskets achieve comprehensive algorithmic coverage, offering a robust, simple, and computationally efficient way to identify core knowledge structures. In practice, we find that those nodes with zero unit criticality capture the information in main paths in almost all cases and capture a wider range of key nodes without unnecessarily increasing the number of nodes considered. We find no advantage in using the traditional main path methods.

Understanding Main Path Analysis

TL;DR

This work addresses the lack of theoretical grounding in Main Path Analysis by establishing an information-theoretic and geometric basis for edge weights and path selection. It introduces a basket-based framework using generalised criticality to capture near-optimal and diverse core nodes, demonstrating robustness and scalability across artificial DAGs and real-world networks. The study shows that traditional SPC/SPE single-path methods offer little advantage over simpler unit-weight approaches, and that baskets effectively summarize the backbone of knowledge flows. Overall, the paper provides a practical, interpretable methodology for identifying key knowledge structures in large DAGs, with broad implications for bibliometrics and network science.

Abstract

Main path analysis has long been used to trace knowledge trajectories in citation networks, yet it lacks solid theoretical foundations. To understand when and why this approach succeeds, we analyse directed acyclic graphs created from two types of artificial models and by looking at over twenty networks derived from real data. We show that entropy-based variants of main path analysis optimise geometric distance measures, providing its first information-theoretic and geometric basis. Numerical results demonstrate that existing algorithms converge on near-geodesic solutions. We also show that an approach based on longest paths produces similar results, is equally well motivated yet is much simpler to implement. However, the traditional single-path focus is unnecessarily restrictive, as many near-optimal paths highlight different key nodes. We introduce an approach using ``baskets'' of nodes where we select a fraction of nodes with the smallest values of a measure we call ``generalised criticality''. Analysis of large vaccine citation networks shows that these baskets achieve comprehensive algorithmic coverage, offering a robust, simple, and computationally efficient way to identify core knowledge structures. In practice, we find that those nodes with zero unit criticality capture the information in main paths in almost all cases and capture a wider range of key nodes without unnecessarily increasing the number of nodes considered. We find no advantage in using the traditional main path methods.

Paper Structure

This paper contains 62 sections, 34 equations, 26 figures, 18 tables.

Figures (26)

  • Figure 1: Diagram illustrating the traversal count edge weight used in main path analysis on an interval DAG. The solid red square is the only source node $S$ while the solid green hexagon indicates the only sink node $T$. Each distinct path from $S$ to $T$ is counted once in SPC main path analysis. We illustrate the traversal count of the edge $(D,E)$ shown in purple with a double headed arrow which is $G^{\mathrm{(spc)}}_{ED}=4$. This comes from two parts. There are just two paths flowing to $D$ from initial vertices, the paths $(S,A,D)$ and $(S,B,D)$ shown in red with double headed arrows. These paths are counted by the $W_v$ node values shown so $W_D=2$. In a similar way we count paths from node $E$ to the only sink node $T$. In this case there are two such paths, $(E,H,J,T)$ and $(E,G,J,T)$ shown with green double headed arrows, giving $X_E=2$. These paths to the sink node $T$ are counted by the $X_v$ values shown by nodes. This then gives the traversal count of $(D,E)$ to be $G_{ED}=W_D X_E=4$. These two SPC main paths are also longest paths (i.e. by unit weight) but there are two additional longest paths found by swapping node $B$ for $A$ in the two SPC main paths.
  • Figure 2: An example of an SPC main path on an interval DAG. The solid red square indicates the source node $S$ and all paths in the main path analysis start from this. The solid green hexagon is the sink node $T$. The labels on an edge from node $u$ to node $v$ show the $W_u \times X_v$ values where we count every path from the initial nodes to final node with equal weight. The products of these numbers give the edge weight for each edge, the traversal counts $G^{\mathrm{(spc)}}_{uv}$. The blue edges with double arrow heads indicate that the path $(S,A,D,E,H,J,T)$ which is one of two possible SPC main paths in this example (we can swap nodes $G$ and $H$ in this main path to find the other main path).
  • Figure 3: Illustration of a hypercuboid lattice interval $\mathcal{D}^\mathrm{(int)}(s,t)$ in two dimensions, $D=2$, i.e. on a rectangle section of a square lattice $(L_1,L_2) = (3,2)$ with a single source node $s=(0,0)$ and a single target node $t=(3,2)$. Edges are between nearest neighbours and are directed either right or up. All the nodes in this interval DAG $\mathcal{D}^\mathrm{(int)}(s,t)$ lie on a path running from $s$ to $t$. All paths between any two nodes have the same network length, i.e. they contain the same number of edges. Here all paths from the source to the target have length $\sum_{i=1,2} (t_1-s_i) = 5$, measured in terms of number of edges or in terms of the Euclidean length of each edge. The path shown in thick black arrows is the sequence of nodes $(s,a,b,c,d,t)$ which can be represented as a sequence of directions, here $(R,R,U,R,U)$ with $R$ for right and $U$ for up. The diagonal shown as a red dashed arrow is the shortest path, the geodesic $\Gamma(s,t)$, between the source and target nodes as measured by the Euclidean distance in the continuous space $\mathbb{I}(s,t)$. Note that there is an order-reversal symmetry in which the direction of the arrows is reversed (equivalent here to a rotation by $180^\circ$). This symmetry links node $a$ to node $\bar{a}$ etc. It also links the path shown to a second path $(s,\bar{d},\bar{c},\bar{b},\bar{a},t)$ which is created using a sequence of moves which is the reverse of the original path, i.e. $(U,R, U,R,R)$.
  • Figure 4: Diagram illustrating the traversal count edge weight used in main path analysis on a square lattice. The paths all run from a single source $s$ and to a single target node $t$, so this is an example of an SPC main path analysis. On the left, we show the number of paths $W^\mathrm{(spc)}_u$ of (\ref{['e:Wspcdef']}) reaching a vertex from $s$ showing how these values are a section of Pascal's triangle. On the right, we show the SPC weight $G^{\mathrm{(spc)}}_{vu}$ (\ref{['e:Gspcdef']}) of each edge $(u,v)$. The order-reversal symmetry of this directed lattice is evident from the fact that the edge labels in the right-hand network do not change under a $180^\circ$ rotation.
  • Figure 5: The perpendicular distance $\Delta (v_i)$ of (\ref{['e:perdistdef']}) for nodes $v_i$ in various paths from $\mathbf{s}=(0,0,0)$ to $\mathbf{t}=(15,17,19)$ on a cubic lattice. Each point represents a node $v_i$ at step-index $i$; lines aid visualisation. The SPC and SPE main path paths ("longest (SPC)" and "longest SPE" respectively) and the greedy perpendicular distance path ("GPD", and "greedy (PD)") remain within one lattice spacing of the geodesic and are almost identical (points 12 and 39 show a very small difference). A random path on the other hand wanders a significant distance from the geodesic until it hits the boundary (here around 34 steps).
  • ...and 21 more figures