Self-similar community structure in organisations

R. Guimera; L. Danon; A. Diaz-Guilera; F. Giralt; A. Arenas

Self-similar community structure in organisations

R. Guimera, L. Danon, A. Diaz-Guilera, F. Giralt, A. Arenas

TL;DR

The results reveal the self-organization of the network into a state where the distribution of community sizes is self-similar, suggesting that a universal mechanism, responsible for emergence of scaling in other self-organized complex systems, as, for instance, river networks, could also be the underlying driving force in the formation and evolution of social networks.

Abstract

The formal chart of an organisation is designed to handle routine and easily anticipated problems, but unexpected situations arise which require the formation of new ties so that the corresponding extra tasks can be properly accomplished. The characterisation of the structure of such informal networks behind the formal chart is a key element for successful management. We analyse the complex e-mail network of a real organisation with about 1,700 employees and determine its community structure. Our results reveal the emergence of self-similar properties that suggest that some universal mechanism could be the underlying driving force in the formation and evolution of informal networks in organisations, as happens in other self-organised complex systems.

Self-similar community structure in organisations

TL;DR

Abstract

Paper Structure (1 equation, 4 figures)

This paper contains 1 equation, 4 figures.

Figures (4)

Figure 1: The e-mail network of URV. The network comprises approximately 1700 users, including faculty, researchers, technicians, managers, administrators, and graduate students. We consider e-mails exchanged between university addresses during the first three months of 2002. Each individual is represented by a node, with two individuals (A and B) being connected if A has sent an e-mail to B and B has also sent an e-mail to A. Bulk e-mails provide little or no information about how individuals or teams collaborate. To minimise their effect: (i) we eliminate e-mails that are sent to more than 50 different recipients and (ii) we disregard links that are unidirectional, that is we consider only e-mails that represent a real communication link, where e-mails flow in both directions. With these two restrictions, the network is undirected and is formed by a main component comprising 1133 nodes and many isolated nodes or pairs of nodes. These little islands are not plotted to keep the figure as simple as possible. The colour of each node identifies an individual's affiliation to a specific centre within the university.
Figure 2: Community identification according to the GN algorithm. a, The betweenness of an edge is defined as the number of minimum paths connecting pairs of nodes that go through that edge wasserman94newman01. The GN algorithm is based on the idea that the edges which connect highly clustered communities have a higher edge betweenness---in this case, edge $BE$---and therefore cutting these edges should separate communities. The algorithm proceeds by identifying and removing the link with the highest betweenness in the network. After every removal, the betweenness of the edges is recalculated. This process is repeated until the 'parent' network splits, producing two separate 'offspring' networks. The offspring can be split further in the same way until they comprise of only one individual. b, In order to describe the entire splitting process, we generate a binary tree, in which bifurcations (white nodes) depict communities and leaves (black nodes) represent individual addresses of the e-mail network. At the beginning of the process, the network is a single entity, represented by node 1 in the tree. After the removal of the edge $BE$, the network is split into two subnetworks, 2 and 3, containing addresses A to D and E to I respectively. The two offspring networks have no further internal community structure. Consider first, subnetwork 2 containing nodes A to D. When all the links are equivalent and have the same betweenness as in the present case, one of them will be selected at random for removal. It is straightforward to show that, iterating the link removal procedure, nodes will be separated one by one and randomly by the GN algorithm, generating a branch in the binary tree. As an example, the figure represents a situation in which $B$ is separated first, then $A$, and finally $D$ and $C$, but a different random selection of links would lead to a different separation order. Similarly, in subnetwork 3 nodes will be separated one by one and at random, except for the fact that the most central node, $E$, will always be separated last. In general, for large networks in which the probability of having two links with the same betweenness is very small, it will still be true that communities will appear as branches in the community binary tree and that the tips of the branches will correspond to the most central agents in the network.
Figure 3: Communities in the e-mail network of URV. a, Binary tree showing the result of applying the GN algorithm to the e-mail network of URV. The position indicated by the arrow represents the root of the tree (equivalent to node 1 in figure \ref{['algorithm']}b) and branches are depicted so that they can be clearly differentiated. In particular, only the leaves of the tree, that correspond to e-mail addresses, are plotted, as shown in the detail that is zoomed. The colour of each of the leaves represents different centres within the university (five small centres containing less than 10 individuals are assigned the same colour). Nodes of the same colour (from the same centre) tend to stick together in the same branch meaning that individuals within the same department tend to communicate more, and that the algorithm is capable of resolving separate centres to a good degree of accuracy. The complicated branching structure resembles self-similar systems in nature such as river networks or diffusion-limited aggregates. b, Same as before but without showing the leaves. Branches are now coloured according to their Horton-Strahler index (see text) c, Binary tree showing the result of applying the GN algorithm to a random graph with the same size and connectivity than the e-mail network. The lack of community structure is reflected in the absence of branches in the tree, which contrasts with the intricate self-similar structure of a and b. Again, colours correspond to Horton-Strahler indices.
Figure 4: Self-similarity in the community structure. a, Calculation of the community size distribution for a binary tree generated by the community identification algorithm. Black nodes represent the actual nodes of the original graph while white nodes are just graphical representations of communities that arise as a result of the splitting procedure. Nodes $A$ and $B$ belong to a community of size 2, and together with $E$ form a community of size 3. Similarly, $C$, $D$ and $F$ form another community of size 3. These two groups together form a higher level community of size 6. Following up to higher and higher levels, the community structure can be regarded as the set of nested groups. The size, $s_i$, of a community $i$ is just the summation of the sizes of its two offspring $j_1$ and $j_2$: $s_i=s_{j_1} + s_{j_2}$. In this case there are three communities of size 2, three communities of size 3, one community of size 6, one community of size 7, and one community of size 10. Note that a single node belongs to different communities at different levels. b, Calculation of the drainage area distribution for a river network. The drainage area of a given point is the number of nodes upstream of it plus one. For a point $i$ with offspring $j_1$ and $j_2$, $s_i=s_{j_1} + s_{j_2} + 1$. c, Calculation of the Horton-Strahler index. The index of a branch changes when it meets a branch with higher index, or when it meets a branch with the same value and both of them join forming a branch with higher index. In this case, there are 10 branches with index 1, 3 branches with index 2, and 1 branch with index 3. d, The distribution of community sizes, $P(s)$, showing a power law region with the exponent -0.48, followed by a sharp decrease at $s\approx 100$ and a cutoff corresponding to the size of the system at $s\approx 1000$. The distribution of community sizes in a random network is shown with a dotted line for comparison. e, The number of branches with HS index $i$, as a function of $i$. From the definition of the branching ratio, it is straightforward to show that, when topological self-similarity holds, $N_i=N_1 / B^{i-1}$. A fitting of this function to the points obtained for the e-mail community tree yields excellent agreement with $B=5.76$. A much worse agreement is obtained for the community tree corresponding to the random network, with $B_i$ fluctuating around 3.46.