Table of Contents
Fetching ...

Implications of construction decisions in keyword-based networks: an empirical assessment

James Nevin, Salvatore Flavio Pileggi, Michael Lees, Paul Groth

TL;DR

The paper investigates how data preprocessing choices in keyword-based networks influence network properties, node rankings, and community structure using two empirical case studies: Twitter and Scopus. By applying automatic keyword extraction with multiple keyword-length settings and sentiment partitions, it demonstrates pervasive sensitivity across global metrics and top nodes, underscoring the need for transparent preprocessing reporting. The study's dual-domain evidence suggests that preprocessing decisions can meaningfully alter conclusions, motivating replication with varied pipelines and careful methodological documentation. The work advances methodological rigor in text-derived network analysis and has practical implications for the reliability and comparability of studies across social media and bibliometric domains.

Abstract

The large amounts of data continuously generated online offer opportunities to identify and analyse trends in various aspects of society. For instance, data from online social media are frequently used as a means of analysing informal interactions, opinions, and feelings of groups of people. Additionally, bibliometric data can be used to investigate more formal trends that occur in scientific research. A popular approach to analysing such complex semi-structured data is the construction of complex networks based on keywords or concept extraction. However, such keyword-based complex network data are often shared in a preprocessed form, with little information about the underlying process used to construct it. Indeed, key decisions are normally made at an early stage in the construction of complex networks from raw data, and can have a significant impact on subsequent analysis and interpretation. In this paper, we highlight the sensitivity of results to data preprocessing decisions by looking at two different case studies which employ networks constructed from underlying semi-structured data. The experiments conducted show high sensitivity to data preprocessing for many commonly adopted metrics. These results demonstrate the need for transparent reporting of data lineage and preprocessing decisions.

Implications of construction decisions in keyword-based networks: an empirical assessment

TL;DR

The paper investigates how data preprocessing choices in keyword-based networks influence network properties, node rankings, and community structure using two empirical case studies: Twitter and Scopus. By applying automatic keyword extraction with multiple keyword-length settings and sentiment partitions, it demonstrates pervasive sensitivity across global metrics and top nodes, underscoring the need for transparent preprocessing reporting. The study's dual-domain evidence suggests that preprocessing decisions can meaningfully alter conclusions, motivating replication with varied pipelines and careful methodological documentation. The work advances methodological rigor in text-derived network analysis and has practical implications for the reliability and comparability of studies across social media and bibliometric domains.

Abstract

The large amounts of data continuously generated online offer opportunities to identify and analyse trends in various aspects of society. For instance, data from online social media are frequently used as a means of analysing informal interactions, opinions, and feelings of groups of people. Additionally, bibliometric data can be used to investigate more formal trends that occur in scientific research. A popular approach to analysing such complex semi-structured data is the construction of complex networks based on keywords or concept extraction. However, such keyword-based complex network data are often shared in a preprocessed form, with little information about the underlying process used to construct it. Indeed, key decisions are normally made at an early stage in the construction of complex networks from raw data, and can have a significant impact on subsequent analysis and interpretation. In this paper, we highlight the sensitivity of results to data preprocessing decisions by looking at two different case studies which employ networks constructed from underlying semi-structured data. The experiments conducted show high sensitivity to data preprocessing for many commonly adopted metrics. These results demonstrate the need for transparent reporting of data lineage and preprocessing decisions.

Paper Structure

This paper contains 38 sections, 9 figures, 17 tables.

Figures (9)

  • Figure 1: General methodology for processing tweets into node count and edge list. All blocks highlighted in yellow represent preprocessing decisions. The methodology is similar for Scopus articles, excluding the sentiment steps and relevance score threshold.
  • Figure 2: Original (a) and processed (b) data object.
  • Figure 3: Example subgraph of word co-occurrence network from Twitter data, where extracted keywords form nodes and the strength of the connections between nodes (thickness of edges) are based on the frequency of co-occurrence in the same tweet
  • Figure 4: Cumulative degree centrality distribution for word co-occurrence networks created using different length keywords, divided by positive, negative, neutral, and all tweets with no relevance score threshold.
  • Figure 5: Cumulative degree centrality distribution for word co-occurrence networks created using different length keywords using all tweets with no, low, and high relevance score threshold.
  • ...and 4 more figures