Table of Contents
Fetching ...

Community detection in bipartite signed networks is highly dependent on parameter choice

Elena Candellone, Erik-Jan van Kesteren, Sofia Chelmi, Javier Garcia-Bernardo

TL;DR

The findings reveal that when no communities are present in the data, these methods often recover spurious user communities, indicating that researchers using community detection methods in the context of bipartite signed networks should not take the communities found at face value.

Abstract

Decision-making processes often involve voting. Human interactions with exogenous entities such as legislations or products can be effectively modeled as two-mode (bipartite) signed networks-where people can either vote positively, negatively, or abstain from voting on the entities. Detecting communities in such networks could help us understand underlying properties: for example ideological camps or consumer preferences. While community detection is an established practice separately for bipartite and signed networks, it remains largely unexplored in the case of bipartite signed networks. In this paper, we systematically evaluate the efficacy of community detection methods on projected bipartite signed networks using a synthetic benchmark and real-world datasets. Our findings reveal that when no communities are present in the data, these methods often recover spurious user communities. When communities are present, the algorithms exhibit promising performance, although their performance is highly susceptible to parameter choice. This indicates that researchers using community detection methods in the context of bipartite signed networks should not take the communities found at face value: it is essential to assess the robustness of parameter choices or perform domain-specific external validation.

Community detection in bipartite signed networks is highly dependent on parameter choice

TL;DR

The findings reveal that when no communities are present in the data, these methods often recover spurious user communities, indicating that researchers using community detection methods in the context of bipartite signed networks should not take the communities found at face value.

Abstract

Decision-making processes often involve voting. Human interactions with exogenous entities such as legislations or products can be effectively modeled as two-mode (bipartite) signed networks-where people can either vote positively, negatively, or abstain from voting on the entities. Detecting communities in such networks could help us understand underlying properties: for example ideological camps or consumer preferences. While community detection is an established practice separately for bipartite and signed networks, it remains largely unexplored in the case of bipartite signed networks. In this paper, we systematically evaluate the efficacy of community detection methods on projected bipartite signed networks using a synthetic benchmark and real-world datasets. Our findings reveal that when no communities are present in the data, these methods often recover spurious user communities. When communities are present, the algorithms exhibit promising performance, although their performance is highly susceptible to parameter choice. This indicates that researchers using community detection methods in the context of bipartite signed networks should not take the communities found at face value: it is essential to assess the robustness of parameter choices or perform domain-specific external validation.
Paper Structure (29 sections, 11 equations, 9 figures)

This paper contains 29 sections, 11 equations, 9 figures.

Figures (9)

  • Figure 1: Synthetic bipartite signed networks. We combine synthetic scenarios and insights from data to generate synthetic networks. Given a synthetic scenario, i.e., ideologies of users and stories modeled from probability distributions $p_{U}$ and $p_{S}$, which can be either unimodal or bimodal (step 1), we sample users' and stories' ideologies, $x_U$ and $x_S$, from those distributions (step 2). For each pair of user and story, the user then either votes positive, negative, or abstained from voting depending on the difference between the user and story ideologies (step 3), where the vote depends on two voting thresholds, $t_+$ and $t_-$, set to match the voting probability in real datasets. Given the bipartite network, we then project it into a unipartite network and apply the community detection methods for a wide range of parameter choices (step 4).
  • Figure 2: Community Detection on Sparse Synthetic Networks. We tested the community detection methods on the four synthetic scenarios. Panels (A,C) show the results for the scenarios with two communities of users, while panels (B,D) show the results with one community. We used the Rand Index to evaluate the performance of the algorithms with different parameter choices. Higher values of the Rand Index indicate better alignment between expected and empirical communities. Panels (A-B) show the results for community-spinglass. We experimented with different combinations of the parameters $\{\gamma^+, \gamma^-\} \in \{0.5, 1, 2\}$. Lower values of $\gamma^+$ indicate less importance given to positive ties in a community, whereas lower values of $\gamma^-$ penalize the presence of negative links within a community. Note that $\gamma^- = 0.5$ is generally the best parameter choice, as it finds the expected communities for the synthetic scenarios. Panels (C--D) show the results for SPONGE. We conducted tests for different values of the number of clusters $k$, ranging from $k=1$ (no communities) to $k=10$, and repeated ten times each iteration, due to the stochasticity of the model. Error bars show the standard deviation around the mean. Note that the algorithm correctly identifies the expected communities in scenarios with polarized users (panel C), while it generates spurious communities in cases where stories introduce a latent ideology to a crowd of neutral users (the U NP S P case in panel D).
  • Figure 3: Community Detection on Dense Synthetic Networks. We tested the community detection methods on the four synthetic scenarios. Panels (A, C) show the results for the scenarios with two communities of users, while panels (B, D) show the results with one community. We used the Rand Index to evaluate the performance of the algorithms with different parameter choices. Higher values of the Rand Index indicate better alignment between expected and empirical communities. Panels (A, B) show the results for community-spinglass. We experimented with different combinations of the parameters $\{\gamma^+, \gamma^-\} \in \{0.5, 1, 2\}$. Lower values of $\gamma^+$ indicate less importance given to positive ties in a community, whereas lower values of $\gamma^-$ penalize the presence of negative links within a community. Note that scenarios where users and stories are either polarized or neutral (i.e., U NP S NP and U P S P) are correctly identified for low values of $\gamma^-$. Panels (C, D) show the results for SPONGE. We conducted tests for different values of the number of clusters $k$, ranging from $k=1$ (no communities) to $k=10$, and repeated ten times each iteration, due to the stochasticity of the model. Error bars show the standard deviation around the mean. The algorithm correctly identifies the expected communities in scenarios with polarized users (panel C), while it generates spurious communities when users are not polarized (panel D).
  • Figure 4: Community Detection on US House of Representatives Networks. We tested the community detection methods on 33 co-voting networks resulting from the (dis)agreement of members of the US House of Representatives on several bills, divided per year. We considered the subdivision into political parties (i.e., Democratic, Republican, or Independents) as the "true" communities. Panel (A) shows the results for community-spinglass. We found that parameter choices with $\gamma^- < 2$ consistently capture the subdivision of Representatives into political parties. Panel (B) shows the results for SPONGE. Confidence intervals represent the standard deviation range among different runs. We conducted tests for different values of the number of clusters $k$, ranging from $k=1$ (no communities) to $k=10$. For $k=1$, we consistently find that the true subdivision is not recovered. However, for subsequent values of $k$, we observe a peak for values $k \in \{2,4\}$ in most of the networks, with a gradual decrease in the method's efficacy as $k$ increases. Note that the Rand Index increased in more recent years.
  • Figure A5: Community Detection on Menéame Synthetic Networks, degree-corrected version. We tested the community detection methods on the four synthetic scenarios, with voting probabilities sampled from sparse data's voting distributions. Panels (A)-(B) show the results for community-spinglass. Lower values of $\gamma^+$ indicate less importance given to positive ties in a community, whereas lower values of $\gamma^-$ penalize the presence of negative links within a community. Note that $\gamma^- = 0.5$ is generally the best parameter choice, as it finds the expected communities for the synthetic scenarios, except for the case where users and stories aren't polarized. Panels (C)-(D) show the results for SPONGE. We observed that the algorithm correctly identifies the expected communities in scenarios with polarized users (panel C), while it generates spurious communities in cases where stories introduce latent ideologies (U NP S P case).
  • ...and 4 more figures