Statistical Parameters of the Novel "Perekhresni stezhky" ("The Cross-Paths") by Ivan Franko

Solomija Buk; Andrij Rovenchak

Statistical Parameters of the Novel "Perekhresni stezhky" ("The Cross-Paths") by Ivan Franko

Solomija Buk, Andrij Rovenchak

TL;DR

This work delivers the first large-scale, quantitative linguistic analysis of a Ukrainian literary novel using a frequency dictionary derived from the Franko:1979 edition. It preprocesses the text with orthography restoration and euphony considerations, and explicitly addresses homonym disambiguation, yielding detailed statistics such as $N=93{,}885$ tokens, $V=9{,}962$ lemmas, and $V_1=4{,}902$ hapax legomena. The study then tests established linguistic laws on word length and distribution, including Zipf–Mandelbrot and Altmann–Menzerath–type relations, and reports parameter values and rank-domain behavior, illustrating two maxima in letter/phoneme distributions and mean syllable length trends. By comparing high-frequency words across languages, it highlights cross-language similarities in functional words and named entities, offering a data-driven baseline for Ukrainian corpus linguistics and cross-linguistic analysis.

Abstract

In the paper, a complex statistical characteristics of a Ukrainian novel is given for the first time. The distribution of word-forms with respect to their size is studied. The linguistic laws by Zipf-Mandelbrot and Altmann-Menzerath are analyzed.

Statistical Parameters of the Novel "Perekhresni stezhky" ("The Cross-Paths") by Ivan Franko

TL;DR

tokens,

lemmas, and

hapax legomena. The study then tests established linguistic laws on word length and distribution, including Zipf–Mandelbrot and Altmann–Menzerath–type relations, and reports parameter values and rank-domain behavior, illustrating two maxima in letter/phoneme distributions and mean syllable length trends. By comparing high-frequency words across languages, it highlights cross-language similarities in functional words and named entities, offering a data-driven baseline for Ukrainian corpus linguistics and cross-linguistic analysis.

Abstract

Paper Structure (7 sections, 4 equations, 3 figures, 1 table)

This paper contains 7 sections, 4 equations, 3 figures, 1 table.

Introduction
Basic Principles of the Text Analysis
Euphony
Homonyms
Statistical Data
Distributions and Linguistic Laws
Comparison

Figures (3)

Figure 1: The distribution of word-forms (fraction of unity, vertical axis) with respect to the number of constituting letters (a) and sounds (b).
Figure 2: The distributions regarding syllabic structure of the words: the fraction of word-forms with respect to constituting syllables (a), the evidence of Menzerath's law (the right-most point was excluded from the fit due to poor statistical reliability).
Figure 3: The transition to different regimes in Zipf's law (a) and text coverage (b). In (a) the dashed-dotted line is the fit to the Zipf--Mandelbrot law.

Statistical Parameters of the Novel "Perekhresni stezhky" ("The Cross-Paths") by Ivan Franko

TL;DR

Abstract

Statistical Parameters of the Novel "Perekhresni stezhky" ("The Cross-Paths") by Ivan Franko

Authors

TL;DR

Abstract

Table of Contents

Figures (3)