Table of Contents
Fetching ...

Probing the statistical properties of enriched co-occurrence networks

Diego R. Amancio, Jeaneth Machicao, Laura V. C. Quispe

TL;DR

This work investigates how enriching word co-occurrence networks with semantic (virtual) edges alters statistical properties used in text analysis. It introduces two key notions: informativeness (a metric’s ability to distinguish meaningful from nonsensical text) and variability ratio (a metric’s bias toward syntactic versus semantic features), evaluated across short and long texts using global and local edge-thresholding strategies and FastText embeddings. Using two datasets—the NEN English novels and the NLANG multilingual New Testament translations—along with shuffling-based normalization, the study reveals that virtual edges can improve informativeness for metrics like the average shortest path length $L$ and closeness centrality $C$ in short texts but may reduce the informativeness of the clustering coefficient $CC$; stopword filtering further modulates these effects. Overall, the results provide guidelines on which network metrics are best suited for particular text sizes and tasks, highlighting that enrichment with semantic edges has nuanced, metric-dependent impacts on topology and linguistic feature capture.

Abstract

Recent studies have explored the addition of virtual edges to word co-occurrence networks using word embeddings to enhance graph representations, particularly for short texts. While these enriched networks have demonstrated some success, the impact of incorporating semantic edges into traditional co-occurrence networks remains uncertain. This study investigates two key statistical properties of text-based network models. First, we assess whether network metrics can effectively distinguish between meaningless and meaningful texts. Second, we analyze whether these metrics are more sensitive to syntactic or semantic aspects of the text. Our results show that incorporating virtual edges can have positive and negative effects, depending on the specific network metric. For instance, the informativeness of the average shortest path and closeness centrality improves in short texts, while the clustering coefficient's informativeness decreases as more virtual edges are added. Additionally, we found that including stopwords affects the statistical properties of enriched networks. Our results can serve as a guideline for determining which network metrics are most appropriate for specific applications, depending on the typical text size and the nature of the problem.

Probing the statistical properties of enriched co-occurrence networks

TL;DR

This work investigates how enriching word co-occurrence networks with semantic (virtual) edges alters statistical properties used in text analysis. It introduces two key notions: informativeness (a metric’s ability to distinguish meaningful from nonsensical text) and variability ratio (a metric’s bias toward syntactic versus semantic features), evaluated across short and long texts using global and local edge-thresholding strategies and FastText embeddings. Using two datasets—the NEN English novels and the NLANG multilingual New Testament translations—along with shuffling-based normalization, the study reveals that virtual edges can improve informativeness for metrics like the average shortest path length and closeness centrality in short texts but may reduce the informativeness of the clustering coefficient ; stopword filtering further modulates these effects. Overall, the results provide guidelines on which network metrics are best suited for particular text sizes and tasks, highlighting that enrichment with semantic edges has nuanced, metric-dependent impacts on topology and linguistic feature capture.

Abstract

Recent studies have explored the addition of virtual edges to word co-occurrence networks using word embeddings to enhance graph representations, particularly for short texts. While these enriched networks have demonstrated some success, the impact of incorporating semantic edges into traditional co-occurrence networks remains uncertain. This study investigates two key statistical properties of text-based network models. First, we assess whether network metrics can effectively distinguish between meaningless and meaningful texts. Second, we analyze whether these metrics are more sensitive to syntactic or semantic aspects of the text. Our results show that incorporating virtual edges can have positive and negative effects, depending on the specific network metric. For instance, the informativeness of the average shortest path and closeness centrality improves in short texts, while the clustering coefficient's informativeness decreases as more virtual edges are added. Additionally, we found that including stopwords affects the statistical properties of enriched networks. Our results can serve as a guideline for determining which network metrics are most appropriate for specific applications, depending on the typical text size and the nature of the problem.

Paper Structure

This paper contains 13 sections, 5 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Methodology used in this study analyzes enriched networks. All pairs of similarities are calculated, and the similarity weights are sorted in decreasing order. To filter the edges, the global strategy selects those with the highest weights across the entire network, while the local strategy evaluates the importance of an edge based on the local structure of each node. The total number of included edges is a parameter that varies throughout the analysis.
  • Figure 2: Example of hypothetical closeness centrality values obtained from the NEN and NLANG datasets. Because the variability in the NEN dataset is greater than that in the NLANG dataset, we have $V_R < 1$. This indicates that the metric is more dependent on semantics than on syntax.
  • Figure 3: Global Strategy: Distribution of Informativeness and Variability measures for the Average Shortest Path (L), Closeness Centrality (C), Clustering Coefficient (CC), Betweenness Centrality (B), PageRank (PR), and Eigenvector Centrality (EV), with the addition of virtual edges in networks generated with variate text sizes and with filtering stop-word.
  • Figure 4: Global Strategy: Distribution of Informativeness and Variability measures for the Average Shortest Path (L*), Closeness Centrality (C*), Clustering Coefficient (CC*), Betweenness Centrality (B*), PageRank (PR*), and Eigenvector Centrality (EV*), with the addition of virtual edges in networks generated with variate text sizes and with filtering stop-word.
  • Figure S1: Local Strategy: Distribution of Informativeness and Variability measures for the Average Shortest Path (L), Closeness Centrality (C), Clustering Coefficient (CC), Betweenness Centrality (B), PageRank (PR), and Eigenvector Centrality (EV), with the addition of virtual edges in networks generated with variate text sizes and with filtering stop-word.
  • ...and 1 more figures