Statistical Parameters of the Novel "Perekhresni stezhky" ("The Cross-Paths") by Ivan Franko
Solomija Buk, Andrij Rovenchak
TL;DR
This work delivers the first large-scale, quantitative linguistic analysis of a Ukrainian literary novel using a frequency dictionary derived from the Franko:1979 edition. It preprocesses the text with orthography restoration and euphony considerations, and explicitly addresses homonym disambiguation, yielding detailed statistics such as $N=93{,}885$ tokens, $V=9{,}962$ lemmas, and $V_1=4{,}902$ hapax legomena. The study then tests established linguistic laws on word length and distribution, including Zipf–Mandelbrot and Altmann–Menzerath–type relations, and reports parameter values and rank-domain behavior, illustrating two maxima in letter/phoneme distributions and mean syllable length trends. By comparing high-frequency words across languages, it highlights cross-language similarities in functional words and named entities, offering a data-driven baseline for Ukrainian corpus linguistics and cross-linguistic analysis.
Abstract
In the paper, a complex statistical characteristics of a Ukrainian novel is given for the first time. The distribution of word-forms with respect to their size is studied. The linguistic laws by Zipf-Mandelbrot and Altmann-Menzerath are analyzed.
