Table of Contents
Fetching ...

Statistical Parameters of the Novel "Perekhresni stezhky" ("The Cross-Paths") by Ivan Franko

Solomija Buk, Andrij Rovenchak

TL;DR

This work delivers the first large-scale, quantitative linguistic analysis of a Ukrainian literary novel using a frequency dictionary derived from the Franko:1979 edition. It preprocesses the text with orthography restoration and euphony considerations, and explicitly addresses homonym disambiguation, yielding detailed statistics such as $N=93{,}885$ tokens, $V=9{,}962$ lemmas, and $V_1=4{,}902$ hapax legomena. The study then tests established linguistic laws on word length and distribution, including Zipf–Mandelbrot and Altmann–Menzerath–type relations, and reports parameter values and rank-domain behavior, illustrating two maxima in letter/phoneme distributions and mean syllable length trends. By comparing high-frequency words across languages, it highlights cross-language similarities in functional words and named entities, offering a data-driven baseline for Ukrainian corpus linguistics and cross-linguistic analysis.

Abstract

In the paper, a complex statistical characteristics of a Ukrainian novel is given for the first time. The distribution of word-forms with respect to their size is studied. The linguistic laws by Zipf-Mandelbrot and Altmann-Menzerath are analyzed.

Statistical Parameters of the Novel "Perekhresni stezhky" ("The Cross-Paths") by Ivan Franko

TL;DR

This work delivers the first large-scale, quantitative linguistic analysis of a Ukrainian literary novel using a frequency dictionary derived from the Franko:1979 edition. It preprocesses the text with orthography restoration and euphony considerations, and explicitly addresses homonym disambiguation, yielding detailed statistics such as tokens, lemmas, and hapax legomena. The study then tests established linguistic laws on word length and distribution, including Zipf–Mandelbrot and Altmann–Menzerath–type relations, and reports parameter values and rank-domain behavior, illustrating two maxima in letter/phoneme distributions and mean syllable length trends. By comparing high-frequency words across languages, it highlights cross-language similarities in functional words and named entities, offering a data-driven baseline for Ukrainian corpus linguistics and cross-linguistic analysis.

Abstract

In the paper, a complex statistical characteristics of a Ukrainian novel is given for the first time. The distribution of word-forms with respect to their size is studied. The linguistic laws by Zipf-Mandelbrot and Altmann-Menzerath are analyzed.

Paper Structure

This paper contains 7 sections, 4 equations, 3 figures, 1 table.

Figures (3)

  • Figure 1: The distribution of word-forms (fraction of unity, vertical axis) with respect to the number of constituting letters (a) and sounds (b).
  • Figure 2: The distributions regarding syllabic structure of the words: the fraction of word-forms with respect to constituting syllables (a), the evidence of Menzerath's law (the right-most point was excluded from the fit due to poor statistical reliability).
  • Figure 3: The transition to different regimes in Zipf's law (a) and text coverage (b). In (a) the dashed-dotted line is the fit to the Zipf--Mandelbrot law.