Finding coherent node groups in directed graphs
Iiro Kumpulainen, Nikolaj Tatti
TL;DR
The paper tackles directed graphs with node features by formulating directed graph segmentation (dgs): partitioning into an ordered sequence of coherent groups while minimizing intra-group $L_2$ loss and penalizing cross-edges with forward/backward weights. It develops an exact MILP, a versatile iterative heuristic framework (dgs-partition, dgs-centroid, dgs-sort), and LP-based approximate methods with provable guarantees, plus efficient polynomial-time solutions for tree inputs and the $k=2$ case. The authors prove NP-hardness and APX-hardness results for general instances, derive a $k-1$ approximation via LP rounding, and, in the symmetric case, a $(k+1)/3$ bound, with extensive experiments showing practical performance and interpretable partitions on synthetic and real networks. The work provides a practical toolbox for structure-aware clustering of directed networks with node features, with potential extensions to other loss functions, feature types, and edge-centric models.
Abstract
Grouping the nodes of a graph into clusters is a standard technique for studying networks. We study a problem where we are given a directed network and are asked to partition the graph into a sequence of coherent groups. We assume that nodes in the network have features, and we measure the group coherence by comparing these features. Furthermore, we incorporate the cross edges by penalizing the forward cross edges and backward cross edges with different weights. If the weights are set to 0, then the problem is equivalent to clustering. However, if we penalize the backward edges, the order of discovered groups matters, and we can view our problem as a generalization of a classic segmentation problem. We consider a common iterative approach where we solve the groups given the centroids, and then find the centroids given the groups. We show that, unlike in clustering, the first subproblem is NP-hard. However, we show that we can solve the subproblem exactly if the underlying graph is a tree or if the number of groups is 2. For a general case, we propose an approximation algorithm based on linear programming. We propose 3 additional heuristics: (1) optimizing each pair of groups separately while keeping the remaining groups intact, (2) computing a spanning tree and then optimizing using only the edges in that, and (3) a greedy search moving nodes between the groups while optimizing the overall loss. We demonstrate with our experiments that the algorithms are practical and yield interpretable results.
