Probing the statistical properties of enriched co-occurrence networks
Diego R. Amancio, Jeaneth Machicao, Laura V. C. Quispe
TL;DR
This work investigates how enriching word co-occurrence networks with semantic (virtual) edges alters statistical properties used in text analysis. It introduces two key notions: informativeness (a metric’s ability to distinguish meaningful from nonsensical text) and variability ratio (a metric’s bias toward syntactic versus semantic features), evaluated across short and long texts using global and local edge-thresholding strategies and FastText embeddings. Using two datasets—the NEN English novels and the NLANG multilingual New Testament translations—along with shuffling-based normalization, the study reveals that virtual edges can improve informativeness for metrics like the average shortest path length $L$ and closeness centrality $C$ in short texts but may reduce the informativeness of the clustering coefficient $CC$; stopword filtering further modulates these effects. Overall, the results provide guidelines on which network metrics are best suited for particular text sizes and tasks, highlighting that enrichment with semantic edges has nuanced, metric-dependent impacts on topology and linguistic feature capture.
Abstract
Recent studies have explored the addition of virtual edges to word co-occurrence networks using word embeddings to enhance graph representations, particularly for short texts. While these enriched networks have demonstrated some success, the impact of incorporating semantic edges into traditional co-occurrence networks remains uncertain. This study investigates two key statistical properties of text-based network models. First, we assess whether network metrics can effectively distinguish between meaningless and meaningful texts. Second, we analyze whether these metrics are more sensitive to syntactic or semantic aspects of the text. Our results show that incorporating virtual edges can have positive and negative effects, depending on the specific network metric. For instance, the informativeness of the average shortest path and closeness centrality improves in short texts, while the clustering coefficient's informativeness decreases as more virtual edges are added. Additionally, we found that including stopwords affects the statistical properties of enriched networks. Our results can serve as a guideline for determining which network metrics are most appropriate for specific applications, depending on the typical text size and the nature of the problem.
