Table of Contents
Fetching ...

Topology of Syntax Networks across Languages

Juan Soria-Postigo, Luis F Seoane

TL;DR

This work investigates whether syntactic dependencies across languages converge to a universal topological scaffold by constructing undirected syntax graphs from Universal Dependencies corpora for 50 languages and focusing on the 500 most frequent tokens. It introduces a morphospace-based node analysis to identify Topological Communities (TCs) via PCA, enabling a per-word functional taxonomy within each network. The results reveal a largely universal core-periphery backbone and connector structures across languages, while standard global-topology measures offer limited phylogenetic insight and outliers challenge evolutionary classifications. The Spanish inflected network case study demonstrates how TC analysis captures role-based organization and cross-language patterns, with implications for comparative linguistics and potential neurolinguistic interpretation of syntax networks.

Abstract

Syntax connects words to each other in very specific ways. Two words are syntactically connected if they depend directly on each other. Syntactic connections usually happen within a sentence. Gathering all those connection across several sentences gives birth to syntax networks. Earlier studies in the field have analysed the structure and properties of syntax networks trying to find clusters/phylogenies of languages that share similar network features. The results obtained in those studies will be put to test in this thesis by increasing both the number of languages and the number of properties considered in the analysis. Besides that, language networks of particular languages will be inspected in depth by means of a novel network analysis [25]. Words (nodes of the network) will be clustered into topological communities whose members share similar features. The properties of each of these communities will be thoroughly studied along with the Part of Speech (grammatical class) of each word. Results across different languages will also be compared in an attempt to discover universally preserved structural patterns across syntax networks.

Topology of Syntax Networks across Languages

TL;DR

This work investigates whether syntactic dependencies across languages converge to a universal topological scaffold by constructing undirected syntax graphs from Universal Dependencies corpora for 50 languages and focusing on the 500 most frequent tokens. It introduces a morphospace-based node analysis to identify Topological Communities (TCs) via PCA, enabling a per-word functional taxonomy within each network. The results reveal a largely universal core-periphery backbone and connector structures across languages, while standard global-topology measures offer limited phylogenetic insight and outliers challenge evolutionary classifications. The Spanish inflected network case study demonstrates how TC analysis captures role-based organization and cross-language patterns, with implications for comparative linguistics and potential neurolinguistic interpretation of syntax networks.

Abstract

Syntax connects words to each other in very specific ways. Two words are syntactically connected if they depend directly on each other. Syntactic connections usually happen within a sentence. Gathering all those connection across several sentences gives birth to syntax networks. Earlier studies in the field have analysed the structure and properties of syntax networks trying to find clusters/phylogenies of languages that share similar network features. The results obtained in those studies will be put to test in this thesis by increasing both the number of languages and the number of properties considered in the analysis. Besides that, language networks of particular languages will be inspected in depth by means of a novel network analysis [25]. Words (nodes of the network) will be clustered into topological communities whose members share similar features. The properties of each of these communities will be thoroughly studied along with the Part of Speech (grammatical class) of each word. Results across different languages will also be compared in an attempt to discover universally preserved structural patterns across syntax networks.

Paper Structure

This paper contains 22 sections, 3 equations, 12 figures, 8 tables.

Figures (12)

  • Figure 1: Syntactic dependencies of the following text: "Some of the content in this topic may not be applicable to some languages. You can create SQL queries in one of two ANSI SQL query modes: ANSI-89 describes the traditional Jet SQL syntax". a Syntactic trees. b Syntactic network
  • Figure 2: Screenshot from corpus with an example of the annotated sentence "Some of the content in this topic may not be applicable to some languages"
  • Figure 3: Num. lines vs. Num. languages. Figure shows number of languages in our database that have a corpus with more than a certain number of lines
  • Figure 4: Node properties and eigenspace representation summary for inflected Spanish. a Heat map of covariances of primary node properties, as listed in Table \ref{['tab:primaryproperties']}. b Heat map of eigenvectors derived from all node properties $P$ as described in section 2.2.2. c Scatter plot of the first three principal components, coloured in normalized RGB colours according to the normalized first three principal components d Network coloured in normalized RGB colours according to the normalized first three principal components.
  • Figure 5: Language comparison accross principal components extracted from mean properties as described in section 2.2.1. Note that OL stands for Outliers. a First two principal components from mean properties of inflected languages coloured by hierarchical clustering communities. b First two principal components from mean properties of lemmatized languages coloured by hierarchical clustering communities. c Dendrogram showing relationships among inflected languages based on hierarchical clustering. d Dendrogram showing relationships among lemmatized languages based on hierarchical clustering
  • ...and 7 more figures