Table of Contents
Fetching ...

How does Burrows' Delta work on medieval Chinese poetic texts?

Boris Orekhov

TL;DR

This paper evaluates Burrows' Delta for authorship attribution in medieval Chinese Tang poetry, a script without spaces and with limited inflection, challenging typical word-based Delta applications. It adopts a character-level Delta approach, using the standard intertextual distance $ \Delta = \sum_{i=1}^{n} \frac{|z(x_i) - z(y_i)|}{n}$ and the Stylo package to compare samples built from the Complete Tang Poems. By constructing multiple five-fold randomized corpora from the 20 most prolific Tang poets and evaluating 100-character tokens, the study demonstrates robust author clustering with no cross-author confusions. The results indicate that Delta remains effective across languages and writing systems, and that raw text may be used without complex tokenization. This suggests practical applicability for attribution of medieval Chinese texts and motivates further exploration of token counts and sample configurations.

Abstract

Burrows' Delta was introduced in 2002 and has proven to be an effective tool for author attribution. Despite the fact that these are different languages, they mostly belong to the same grammatical type and use the same graphic principle to convey speech in writing: a phonemic alphabet with word separation using spaces. The question I want to address in this article is how well this attribution method works with texts in a language with a different grammatical structure and a script based on different principles. There are fewer studies analyzing the effectiveness of the Delta method on Chinese texts than on texts in European languages. I believe that such a low level of attention to Delta from sinologists is due to the structure of the scientific field dedicated to medieval Chinese poetry. Clustering based on intertextual distances worked flawlessly. Delta produced results where clustering showed that the samples of one author were most similar to each other, and Delta never confused different poets. Despite the fact that I used an unconventional approach and applied the Delta method to a language poorly suited for it, the method demonstrated its effectiveness. Tang dynasty poets are correctly identified using Delta, and the empirical pattern observed for authors writing in European standard languages has been confirmed once again.

How does Burrows' Delta work on medieval Chinese poetic texts?

TL;DR

This paper evaluates Burrows' Delta for authorship attribution in medieval Chinese Tang poetry, a script without spaces and with limited inflection, challenging typical word-based Delta applications. It adopts a character-level Delta approach, using the standard intertextual distance and the Stylo package to compare samples built from the Complete Tang Poems. By constructing multiple five-fold randomized corpora from the 20 most prolific Tang poets and evaluating 100-character tokens, the study demonstrates robust author clustering with no cross-author confusions. The results indicate that Delta remains effective across languages and writing systems, and that raw text may be used without complex tokenization. This suggests practical applicability for attribution of medieval Chinese texts and motivates further exploration of token counts and sample configurations.

Abstract

Burrows' Delta was introduced in 2002 and has proven to be an effective tool for author attribution. Despite the fact that these are different languages, they mostly belong to the same grammatical type and use the same graphic principle to convey speech in writing: a phonemic alphabet with word separation using spaces. The question I want to address in this article is how well this attribution method works with texts in a language with a different grammatical structure and a script based on different principles. There are fewer studies analyzing the effectiveness of the Delta method on Chinese texts than on texts in European languages. I believe that such a low level of attention to Delta from sinologists is due to the structure of the scientific field dedicated to medieval Chinese poetry. Clustering based on intertextual distances worked flawlessly. Delta produced results where clustering showed that the samples of one author were most similar to each other, and Delta never confused different poets. Despite the fact that I used an unconventional approach and applied the Delta method to a language poorly suited for it, the method demonstrated its effectiveness. Tang dynasty poets are correctly identified using Delta, and the empirical pattern observed for authors writing in European standard languages has been confirmed once again.
Paper Structure (5 sections, 1 equation, 2 figures, 1 table)

This paper contains 5 sections, 1 equation, 2 figures, 1 table.

Figures (2)

  • Figure 1: Cluster analysis of the first shuffled test corpus.
  • Figure 2: Heatmap of the delta text distances in the first shuffled test corpus.