Table of Contents
Fetching ...

Testing network clustering algorithms with Natural Language Processing

Ixandra Achitouv, David Chavalarias, Bruno Gaume

TL;DR

A hybrid methodology to evaluate the alignment between structural communities inferred from interaction networks and the linguistic coherence of users' textual production in online social networks and introduces a coverage–precision trade-off metric to assess community-level performance.

Abstract

The advent of online social networks has led to the development of an abundant literature on the study of online social groups and their relationship to individuals' personalities as revealed by their textual productions. Social structures are inferred from a wide range of social interactions. Those interactions form complex -- sometimes multi-layered -- networks, on which community detection algorithms are applied to extract higher order structures. The choice of the community detection algorithm is however hardily questioned in relation with the cultural production of the individual they classify. In this work, we assume the entangled nature of social networks and their cultural production to propose a definition of cultural based online social groups as sets of individuals whose online production can be categorized as social group-related. We take advantage of this apparently self-referential description of online social groups with a hybrid methodology that combines a community detection algorithm and a natural language processing classification algorithm. A key result of this analysis is the possibility to score community detection algorithms using their agreement with the natural language processing classification. A second result is that we can assign the opinion of a random user at >85% accuracy.

Testing network clustering algorithms with Natural Language Processing

TL;DR

A hybrid methodology to evaluate the alignment between structural communities inferred from interaction networks and the linguistic coherence of users' textual production in online social networks and introduces a coverage–precision trade-off metric to assess community-level performance.

Abstract

The advent of online social networks has led to the development of an abundant literature on the study of online social groups and their relationship to individuals' personalities as revealed by their textual productions. Social structures are inferred from a wide range of social interactions. Those interactions form complex -- sometimes multi-layered -- networks, on which community detection algorithms are applied to extract higher order structures. The choice of the community detection algorithm is however hardily questioned in relation with the cultural production of the individual they classify. In this work, we assume the entangled nature of social networks and their cultural production to propose a definition of cultural based online social groups as sets of individuals whose online production can be categorized as social group-related. We take advantage of this apparently self-referential description of online social groups with a hybrid methodology that combines a community detection algorithm and a natural language processing classification algorithm. A key result of this analysis is the possibility to score community detection algorithms using their agreement with the natural language processing classification. A second result is that we can assign the opinion of a random user at >85% accuracy.

Paper Structure

This paper contains 24 sections, 3 equations, 8 figures.

Figures (8)

  • Figure 1: Visualisation of climate change related tweets from 2022-07-01 until 2022-10-30, where colors represent different communities: cold/warm colors correspond to pro-climate/denialist users respectively. In total there are 29347 accounts (nodes) and 361559 retweets (edges) among those accounts.
  • Figure 2: Histogram of user eigencentrality in our network. The vertical line corresponds to the $.75$-quantile, which corresponds to the cut between anchors and tested users.
  • Figure 3: CDA accuracy based on its agreement with the NLPCA. $L_c$ corresponds to Louvain with parameter c, $B_s$ corresponds to BEC with parameter s and IM to Informap. Top panel: fraction of users where the NLPCA agrees with CDA regardless of the number of tweets. The error bars are 1-sigma deviation computed by Jackknife resampling. The vertical dotted line corresponds to the mean of the accuracy for all CDA we consider. Lower panel: fraction of user where the NLPCA agree with CDA as function of the number of tweet a user made in the testing set.
  • Figure 4: Lower panel: number of communities found by the CDA. Top panel: Precision (percentage of agreement between CDA and NLPCA categorization), Coverage (percentage of users covered by our 4 selected categories) and F-score (weighted score between precision and coverage)
  • Figure 5: correlation between the entropy measure and the number of tweets (top panel) and the classification agreement with NLPCA (lower panel).
  • ...and 3 more figures