Table of Contents
Fetching ...

Competition between Two Kinds of Correlations in Literary Texts

S. S. Melnyk, O. V. Usatenko, V. A. Yampol'skii, V. A. Golick

TL;DR

The paper addresses how to quantify and model long-range correlations in coarse-grained literary texts using additive Markov chains with memory functions. It develops a framework linking memory functions to observed variance and correlation, and demonstrates that texts exhibit antipersistent short-range and power-law persistent long-range correlations, which together shape text statistics. Through analysis of the Bible and other works, it shows a robust, two-regime memory structure and reveals self-similarity under decimation, highlighting grammatical versus semantic contributions. The approach provides a compact, transferable descriptor (the memory function) for symbolic sequences and suggests broader applications to other complex correlated systems.

Abstract

A theory of additive Markov chains with long-range memory is used for description of correlation properties of coarse-grained literary texts. The complex structure of the correlations in texts is revealed. Antipersistent correlations at small distances, L < 300, and persistent ones at L > 300 define this nontrivial structure. For some concrete examples of literary texts, the memory functions are obtained and their power-law behavior at long distances is disclosed. This property is shown to be a cause of self-similarity of texts with respect to the decimation procedure.

Competition between Two Kinds of Correlations in Literary Texts

TL;DR

The paper addresses how to quantify and model long-range correlations in coarse-grained literary texts using additive Markov chains with memory functions. It develops a framework linking memory functions to observed variance and correlation, and demonstrates that texts exhibit antipersistent short-range and power-law persistent long-range correlations, which together shape text statistics. Through analysis of the Bible and other works, it shows a robust, two-regime memory structure and reveals self-similarity under decimation, highlighting grammatical versus semantic contributions. The approach provides a compact, transferable descriptor (the memory function) for symbolic sequences and suggests broader applications to other complex correlated systems.

Abstract

A theory of additive Markov chains with long-range memory is used for description of correlation properties of coarse-grained literary texts. The complex structure of the correlations in texts is revealed. Antipersistent correlations at small distances, L < 300, and persistent ones at L > 300 define this nontrivial structure. For some concrete examples of literary texts, the memory functions are obtained and their power-law behavior at long distances is disclosed. This property is shown to be a cause of self-similarity of texts with respect to the decimation procedure.

Paper Structure

This paper contains 9 sections, 14 equations, 7 figures.

Figures (7)

  • Figure 1: The variance $D(L)$ for the coarse-grained text (letters $(a-m) \mapsto 0$, letters $(n-z) \mapsto 1$) of the Bible (solid line) and the Markov chain generated by means of the reconstructed memory function $F(r)$ (filled circles). The coincidence of these curves proves the robustness of our method of the MF reconstruction. The dotted straight line describes the non-correlated Brownian diffusion, $D_{0}(L)=L\bar{a}(1-\bar{a})$. The inset demonstrates the antipersistent dependence of the dimensionless ratio $D(L)/D_0(L)$ upon $L$ at short distances.
  • Figure 2: The variance $D(L)$ for the coarse-grained text ((letters with even numbers in the Alphabet) $\mapsto 1$, (ones with odd numbers) $\mapsto 0$) of the Bible (dashed line) and for the sequence obtained by shuffling the blocs of the length $L_{0} = 3000$ (filled circles). The solid line represents the analytical results, obtained with Eq. (\ref{['corshuf']}). The dotted straight line describes the non-correlated Brownian diffusion, $D_{0}(L)=L\bar{a}(1-\bar{a})$.
  • Figure 3: The local variance $D_{l}(10)$ for the coarse-grained text of the Bible vs the distance $l$. The averaging interval is $L_{0}=10^{5}$.
  • Figure 4: The memory function $F(r)$ for the coarse-grained text of the Bible at short distances. The power-law decreasing portion of the $F(r)$ plot for the Bible is presented by filled circles in the inset. The solid line corresponds to the power-law fitting.
  • Figure 5: The memory function at long distances for the coarse-grained texts of eight literary works: 1. The Bible, 2. "Oliver Twist" by Charles Dickens, 3. "War and Peace" by Leo Tolstoy, 4. The Tora, 5. "Master and Margarita" by Mikhail Bulgakov, 6. "Don Quixote" by Miguel de Servantes, 7. "Oblomov" by Ivan Goncharov, 8. The Koran.
  • ...and 2 more figures