Table of Contents
Fetching ...

Complex systems approach to natural language

Tomasz Stanisz, Stanisław Drożdż, Jarosław Kwapień

TL;DR

This article surveys how complex-systems concepts illuminate natural language, emphasizing word statistics, entropy and long-range correlations, and network representations. It shows that punctuation, time-series representations of sentence lengths and punctuation waiting times, and word-adjacency networks reveal universal and language-specific patterns, including Zipf-type laws, 1/f noise, and multifractality. The work integrates Zipf, Heaps, and Mandelbrot frameworks with discrete Weibull hazard modeling and multifractal DFA, illustrating how punctuation and processing choices shape statistical properties. Its synthesis supports applying complexity methods to NLP tasks, stylometry, and cognitive linguistics, with implications for NLP engineering and linguistic theory alike.

Abstract

The review summarizes the main methodological concepts used in studying natural language from the perspective of complexity science and documents their applicability in identifying both universal and system-specific features of language in its written representation. Three main complexity-related research trends in quantitative linguistics are covered. The first part addresses the issue of word frequencies in texts and demonstrates that taking punctuation into consideration restores scaling whose violation in the Zipf's law is often observed for the most frequent words. The second part introduces methods inspired by time series analysis, used in studying various kinds of correlations in written texts. The related time series are generated on the basis of text partition into sentences or into phrases between consecutive punctuation marks. It turns out that these series develop features often found in signals generated by complex systems, like long-range correlations or (multi)fractal structures. Moreover, it appears that the distances between punctuation marks comply with the discrete variant of the Weibull distribution. In the third part, the application of the network formalism to natural language is reviewed, particularly in the context of the so-called word-adjacency networks. Parameters characterizing topology of such networks can be used for classification of texts, for example, from a stylometric perspective. Network approach can also be applied to represent the organization of word associations. Structure of word-association networks turns out to be significantly different from that observed in random networks, revealing genuine properties of language. Finally, punctuation seems to have a significant impact not only on the language's information-carrying ability but also on its key statistical properties, hence it is recommended to consider punctuation marks on a par with words.

Complex systems approach to natural language

TL;DR

This article surveys how complex-systems concepts illuminate natural language, emphasizing word statistics, entropy and long-range correlations, and network representations. It shows that punctuation, time-series representations of sentence lengths and punctuation waiting times, and word-adjacency networks reveal universal and language-specific patterns, including Zipf-type laws, 1/f noise, and multifractality. The work integrates Zipf, Heaps, and Mandelbrot frameworks with discrete Weibull hazard modeling and multifractal DFA, illustrating how punctuation and processing choices shape statistical properties. Its synthesis supports applying complexity methods to NLP tasks, stylometry, and cognitive linguistics, with implications for NLP engineering and linguistic theory alike.

Abstract

The review summarizes the main methodological concepts used in studying natural language from the perspective of complexity science and documents their applicability in identifying both universal and system-specific features of language in its written representation. Three main complexity-related research trends in quantitative linguistics are covered. The first part addresses the issue of word frequencies in texts and demonstrates that taking punctuation into consideration restores scaling whose violation in the Zipf's law is often observed for the most frequent words. The second part introduces methods inspired by time series analysis, used in studying various kinds of correlations in written texts. The related time series are generated on the basis of text partition into sentences or into phrases between consecutive punctuation marks. It turns out that these series develop features often found in signals generated by complex systems, like long-range correlations or (multi)fractal structures. Moreover, it appears that the distances between punctuation marks comply with the discrete variant of the Weibull distribution. In the third part, the application of the network formalism to natural language is reviewed, particularly in the context of the so-called word-adjacency networks. Parameters characterizing topology of such networks can be used for classification of texts, for example, from a stylometric perspective. Network approach can also be applied to represent the organization of word associations. Structure of word-association networks turns out to be significantly different from that observed in random networks, revealing genuine properties of language. Finally, punctuation seems to have a significant impact not only on the language's information-carrying ability but also on its key statistical properties, hence it is recommended to consider punctuation marks on a par with words.
Paper Structure (58 sections, 108 equations, 49 figures, 1 table)

This paper contains 58 sections, 108 equations, 49 figures, 1 table.

Figures (49)

  • Figure 1: A network representing lexical relationships between 63 selected languages from the Indo-European family based on a subset of data used in Dyen1992. The data is a multilingual Swadesh list with $N=200$ entries. Each entry corresponds to one meaning and consists of the words representing that meaning in different languages. These words divided into groups reflect their possible common origin. As a result, each pair of words under a given entry is judged as "cognate", "doubtfully cognate", or "not cognate". One can define the proximity $n_c(l_1,l_2)$ between two languages $l_1,l_2$ as the total number of word pairs judged as "cognate" among all entries. Consequently, the distance between $l_1$ and $l_2$ can be expressed as $d(l_1,l_2)=N-n_c(l_1,l_2)$. The network presented above is a directed tree representing hierarchical clustering of the studied languages using the so-defined distance. Each leaf (a node with no incident edges) corresponds to one language, and each internal node (a node with at least one incident edge) is a cluster of languages. Consecutive groupings into bigger and bigger clusters are represented by arrows (directed edges). Each cluster is labeled with its internal minimum proximity: if $k$ is the number labeling the cluster, then the proximity $n_c$ (i.e., the number of shared cognates) between any two languages belonging to that cluster is not smaller than $k$. More advanced methods of analyzing the distances between language lexicons can be useful in reconstruction of the language evolutionary trees Gray2003.
  • Figure 2: (a)-(g) Zipf's plots for books written in various European languages. Each line represents a log-log plot of the rank-frequency distribution (continuous to make graphs more legible) for a single book. Words were not lemmatized. (h) Zipf's law for corpora constructed from a set of books in each language. In all panels the dashed line corresponds to a slope index equal to $-1$.
  • Figure 3: An illustration of the Heaps' law created by using a corpus constructed from sample English books. The dots represent $V(N)$ -- the size of vocabulary as a function of text length. Slope of the dashed line is equal to $\eta = 0.75$. A power-law regime holds for a few orders of magnitude. For small $N$, the relationship $V(N)$ is practically linear as almost every consecutive word in the text expands vocabulary. For large $N$, however, the lack of new yet unencountered words makes $V(N)$ grow more slowly.
  • Figure 4: Survival function $\overline{F}(k)$ of the distribution of the number of balls in a box generated by a realization of the Yule process with $1.5 \cdot 10^5$ time steps. The parameters of the process are: $m=5$, $K_0=3$, $c=1.25$. Since both the argument of the function and its value are under logarithm, a straight line shape indicates the presence of a power law. The dashed line has the slope $-\alpha=-1.85$, corresponding to the limiting distribution.
  • Figure 5: Log-log plots of exemplary functions $\omega(R)$ given by Zipf-Mandelbrot law (Eq. (\ref{['eq::Zipf_Mandelbrot_law']})) with $\alpha = 1$ and different values of $c$.
  • ...and 44 more figures