Table of Contents
Fetching ...

Statistics of punctuation in experimental literature -- the remarkable case of "Finnegans Wake" by James Joyce

Tomasz Stanisz, Stanisław Drożdż, Jarosław Kwapień

TL;DR

This study analyzes punctuation-driven structure in experimental prose by modeling inter-punctuation distances with a discrete Weibull distribution, defined by $F(k)=1-(1-p)^{k^\beta}$ and hazard $h(k)=1-(1-p)^{k^\beta-(k-1)^\beta}$, across a curated set of novels including Joyce. It combines time-series analysis of breakpoints with multifractal detrended fluctuation analysis (MFDFA) to assess multiscaling in both inter-punctuation distances and sentence lengths, revealing that most texts conform to the Weibull regime while Joyce's Finnegans Wake and parts of Ulysses exhibit decreasing hazard and strong multifractal patterns. The results highlight long-range correlations and hierarchical organization in experimental literature, with Finnegans Wake showing especially symmetric and rich multifractality in sentence lengths and a trace of multifractality in punctuation as well. These findings advance our understanding of universal versus style-specific punctuation statistics and offer insights for linguistic theory and natural language processing applications that rely on textual structure and complexity.

Abstract

As the recent studies indicate, the structure imposed onto written texts by the presence of punctuation develops patterns which reveal certain characteristics of universality. In particular, based on a large collection of classic literary works, it has been evidenced that the distances between consecutive punctuation marks, measured in terms of the number of words, obey the discrete Weibull distribution - a discrete variant of a distribution often used in survival analysis. The present work extends the analysis of punctuation usage patterns to more experimental pieces of world literature. It turns out that the compliance of the the distances between punctuation marks with the discrete Weibull distribution typically applies here as well. However, some of the works by James Joyce are distinct in this regard - in the sense that the tails of the relevant distributions are significantly thicker and, consequently, the corresponding hazard functions are decreasing functions not observed in typical literary texts in prose. "Finnegans Wake" - the same one to which science owes the word "quarks" for the most fundamental constituents of matter - is particularly striking in this context. At the same time, in all the studied texts, the sentence lengths - representing the distances between sentence-ending punctuation marks - reveal more freedom and are not constrained by the discrete Weibull distribution. This freedom in some cases translates into long-range nonlinear correlations, which manifest themselves in multifractality. Again, a text particularly spectacular in terms of multifractality is "Finnegans Wake".

Statistics of punctuation in experimental literature -- the remarkable case of "Finnegans Wake" by James Joyce

TL;DR

This study analyzes punctuation-driven structure in experimental prose by modeling inter-punctuation distances with a discrete Weibull distribution, defined by and hazard , across a curated set of novels including Joyce. It combines time-series analysis of breakpoints with multifractal detrended fluctuation analysis (MFDFA) to assess multiscaling in both inter-punctuation distances and sentence lengths, revealing that most texts conform to the Weibull regime while Joyce's Finnegans Wake and parts of Ulysses exhibit decreasing hazard and strong multifractal patterns. The results highlight long-range correlations and hierarchical organization in experimental literature, with Finnegans Wake showing especially symmetric and rich multifractality in sentence lengths and a trace of multifractality in punctuation as well. These findings advance our understanding of universal versus style-specific punctuation statistics and offer insights for linguistic theory and natural language processing applications that rely on textual structure and complexity.

Abstract

As the recent studies indicate, the structure imposed onto written texts by the presence of punctuation develops patterns which reveal certain characteristics of universality. In particular, based on a large collection of classic literary works, it has been evidenced that the distances between consecutive punctuation marks, measured in terms of the number of words, obey the discrete Weibull distribution - a discrete variant of a distribution often used in survival analysis. The present work extends the analysis of punctuation usage patterns to more experimental pieces of world literature. It turns out that the compliance of the the distances between punctuation marks with the discrete Weibull distribution typically applies here as well. However, some of the works by James Joyce are distinct in this regard - in the sense that the tails of the relevant distributions are significantly thicker and, consequently, the corresponding hazard functions are decreasing functions not observed in typical literary texts in prose. "Finnegans Wake" - the same one to which science owes the word "quarks" for the most fundamental constituents of matter - is particularly striking in this context. At the same time, in all the studied texts, the sentence lengths - representing the distances between sentence-ending punctuation marks - reveal more freedom and are not constrained by the discrete Weibull distribution. This freedom in some cases translates into long-range nonlinear correlations, which manifest themselves in multifractality. Again, a text particularly spectacular in terms of multifractality is "Finnegans Wake".
Paper Structure (7 sections, 14 equations, 7 figures, 1 table)

This paper contains 7 sections, 14 equations, 7 figures, 1 table.

Figures (7)

  • Figure 1: Modeling the distribution of the distances between consecutive punctuation marks in Brave New World by Aldous Huxley with the discrete Weibull distribution: all punctuation marks (left column) and sentence-ending marks only (right column). The latter plots are equivalent to the distribution of sentence lengths. The rows show: (top) the empirical distributions shown as gray histograms together with the fitted discrete Weibull distributions denoted by blue symbols, (middle) the rescaled Weibull plots with blue lines corresponding to the fitted distributions, and (bottom) the hazard functions for the empirical data -- marked in black -- and for the fitted discrete Weibull distributions -- marked in blue.
  • Figure 2: Left column: (main) histograms of empirical inter-punctuation-mark distance distributions, along with the fitted discrete Weibull distributions, marked with blue symbols; (insets) the corresponding rescaled Weibull plots. Right column: the empirical hazard functions (black dots) and the hazard functions of the fitted discrete Weibull distributions if such fits are possible (blue curves). Each row corresponds to a particular novel.
  • Figure 3: (continued) The same characteristics for the remaining novels. For Ulysses, the individual hazard functions of the two halves of the text are also shown.
  • Figure 4: The same characteristics as in Fig. \ref{['fig:histograms_and_h']}, for two James Joyce's books: A Portrait of the Artist as a Young Man and Dubliners -- histograms of inter-punctuation-mark distances along with the fitted discrete Weibull distributions (left column) and the corresponding hazard functions (right column).
  • Figure 5: MFDFA applied to time series of distances between consecutive punctuation marks for (a) Rayuela, (b) Finnegans Wake, and (c) Ulysses. For each book, the original time series $x(t)$ (top), the $q$th-order fluctuation functions $F_q(s)$ (bottom left), and the singularity spectrum $f(\alpha)$ (bottom right) are shown. The fluctuation functions for $q=0$ are distinguished by bold lines. Ulysses has been divided in two parts: the first contains chapters 1-10 (plotted in red), the second starts with chapter 11 (plotted in green). As these two parts differ qualitatively in terms of the studied characteristics, they have been analyzed separately; the point separating them (the end of chapter 10) is marked by a vertical dotted-dashed line in the relevant $x(t)$ plot. The same plot shows the end of chapter 3 and the end of chapter 15 (dotted lines), which constitute the partition into 3 parts specified in the book itself (not considered in the analysis; parts 1 and 3 in separation are too short in for such an analysis of statistical character). In the $F_q(s)$ plot, vertical dashed lines mark the range of scaling used in the computation of the $f(\alpha)$ spectrum.
  • ...and 2 more figures