Table of Contents
Fetching ...

Quantifying patterns of punctuation in modern Chinese prose

Michał Dolina, Jakub Dec, Stanisław Drożdż, Jarosław Kwapień, Jin Liu, Tomasz Stanisz

TL;DR

The paper investigates punctuation and word usage patterns in modern Chinese prose using Zipf's law, a discrete Weibull model for inter-punctuation distances, and Multifractal Detrended Fluctuation Analysis to quantify sentence-length variability. By analyzing three contemporary Chinese novels and their English translations, it shows Zipf-like rank-frequency behavior near $\gamma \approx 1$ when counting $n$-grams and demonstrates that punctuation improves Zipf fits; inter-punctuation intervals and sentence lengths align with a discrete Weibull law, with Chinese distances exhibiting thicker tails. Multifractal analysis reveals strong multifractality in sentence lengths for Soul Mountain (and The Drunkard) and more monofractal behavior for The Sun Shines over the Sanggan River, with translations showing broadly similar fractal traits in some cases. Overall, the findings point to universal punctuation and word-distribution patterns across languages and highlight how narrative form influences fractal structure, while calling for broader corpora to validate cross-language generalizations.

Abstract

Recent research shows that punctuation patterns in texts exhibit universal features across languages. Analysis of Western classical literature reveals that the distribution of spaces between punctuation marks aligns with a discrete Weibull distribution, typically used in survival analysis. By extending this analysis to Chinese literature represented here by three notable contemporary works, it is shown that Zipf's law applies to Chinese texts similarly to Western texts, where punctuation patterns also improve adherence to the law. Additionally, the distance distribution between punctuation marks in Chinese texts follows the Weibull model, though larger spacing is less frequent than in English translations. Sentence-ending punctuation, representing sentence length, diverges more from this pattern, reflecting greater flexibility in sentence length. This variability supports the formation of complex, multifractal sentence structures, particularly evident in Gao Xingjian's "Soul Mountain". These findings demonstrate that both Chinese and Western texts share universal punctuation and word distribution patterns, underscoring their broad applicability across languages.

Quantifying patterns of punctuation in modern Chinese prose

TL;DR

The paper investigates punctuation and word usage patterns in modern Chinese prose using Zipf's law, a discrete Weibull model for inter-punctuation distances, and Multifractal Detrended Fluctuation Analysis to quantify sentence-length variability. By analyzing three contemporary Chinese novels and their English translations, it shows Zipf-like rank-frequency behavior near when counting -grams and demonstrates that punctuation improves Zipf fits; inter-punctuation intervals and sentence lengths align with a discrete Weibull law, with Chinese distances exhibiting thicker tails. Multifractal analysis reveals strong multifractality in sentence lengths for Soul Mountain (and The Drunkard) and more monofractal behavior for The Sun Shines over the Sanggan River, with translations showing broadly similar fractal traits in some cases. Overall, the findings point to universal punctuation and word-distribution patterns across languages and highlight how narrative form influences fractal structure, while calling for broader corpora to validate cross-language generalizations.

Abstract

Recent research shows that punctuation patterns in texts exhibit universal features across languages. Analysis of Western classical literature reveals that the distribution of spaces between punctuation marks aligns with a discrete Weibull distribution, typically used in survival analysis. By extending this analysis to Chinese literature represented here by three notable contemporary works, it is shown that Zipf's law applies to Chinese texts similarly to Western texts, where punctuation patterns also improve adherence to the law. Additionally, the distance distribution between punctuation marks in Chinese texts follows the Weibull model, though larger spacing is less frequent than in English translations. Sentence-ending punctuation, representing sentence length, diverges more from this pattern, reflecting greater flexibility in sentence length. This variability supports the formation of complex, multifractal sentence structures, particularly evident in Gao Xingjian's "Soul Mountain". These findings demonstrate that both Chinese and Western texts share universal punctuation and word distribution patterns, underscoring their broad applicability across languages.

Paper Structure

This paper contains 11 sections, 11 equations, 14 figures.

Figures (14)

  • Figure 1: Rank-frequency distributions of words (crosses) and words + punctuation marks (full circles) for the three Chinese books: The Drunkard (top left), The Sun Shines over the Sanggan River (top right), and The Soul Mountain (bottom). The distributions are fitted with a power-law function whose scaling index $\gamma$ is given explicitly in each panel.
  • Figure 2: The same analysis as that in Fig. \ref{['fig::Zipf_CN']} but here for the corresponding English translations.
  • Figure 3: Time series of the distances between consecutive punctuation marks measured in words (left) and characters (right) for three Chinese novels: The Drunkard (top), The Sun Shines over the Sanggan River (middle), and The Soul Mountain (bottom). Insets show the cumulative distribution functions for the time series shown in the main panels. The largest data point, at 22,028 in The Soul Mountain (bottom), which equals 311 words or 494 characters, has been cut off in order to zoom in on the range of the vertical axis.
  • Figure 4: The same quantities as in Fig. \ref{['fig::time.series.punctuation.mark.distances']} but here for the consecutive sentence lengths measured in words (left) and characters (right). Note that the vertical axis in the bottom panels differs from the one in the top and middle panels.
  • Figure 5: Time series of sentence lengths (left) and time series of the distances between consecutive punctuation marks (right) for the English translations of the same Chinese novels as in Figs. \ref{['fig::time.series.punctuation.mark.distances']} and \ref{['fig::time.series.sentence.length']}. Please note the different scales of the vertical axis. Insets show the cumulative distribution functions for the time series shown in the main panels.
  • ...and 9 more figures