Topology of Syntax Networks across Languages
Juan Soria-Postigo, Luis F Seoane
TL;DR
This work investigates whether syntactic dependencies across languages converge to a universal topological scaffold by constructing undirected syntax graphs from Universal Dependencies corpora for 50 languages and focusing on the 500 most frequent tokens. It introduces a morphospace-based node analysis to identify Topological Communities (TCs) via PCA, enabling a per-word functional taxonomy within each network. The results reveal a largely universal core-periphery backbone and connector structures across languages, while standard global-topology measures offer limited phylogenetic insight and outliers challenge evolutionary classifications. The Spanish inflected network case study demonstrates how TC analysis captures role-based organization and cross-language patterns, with implications for comparative linguistics and potential neurolinguistic interpretation of syntax networks.
Abstract
Syntax connects words to each other in very specific ways. Two words are syntactically connected if they depend directly on each other. Syntactic connections usually happen within a sentence. Gathering all those connection across several sentences gives birth to syntax networks. Earlier studies in the field have analysed the structure and properties of syntax networks trying to find clusters/phylogenies of languages that share similar network features. The results obtained in those studies will be put to test in this thesis by increasing both the number of languages and the number of properties considered in the analysis. Besides that, language networks of particular languages will be inspected in depth by means of a novel network analysis [25]. Words (nodes of the network) will be clustered into topological communities whose members share similar features. The properties of each of these communities will be thoroughly studied along with the Part of Speech (grammatical class) of each word. Results across different languages will also be compared in an attempt to discover universally preserved structural patterns across syntax networks.
