Table of Contents
Fetching ...

Author-Specific Linguistic Patterns Unveiled: A Deep Learning Study on Word Class Distributions

Patrick Krauss, Achim Schilling

TL;DR

This study investigates author-specific word class distributions using part-of-speech (POS) tagging and bigram analysis to demonstrate the efficacy of unigram and bigram-based representations and reveal meaningful clustering of authors’ works.

Abstract

Deep learning methods have been increasingly applied to computational linguistics to uncover patterns in text data. This study investigates author-specific word class distributions using part-of-speech (POS) tagging and bigram analysis. By leveraging deep neural networks, we classify literary authors based on POS tag vectors and bigram frequency matrices derived from their works. We employ fully connected and convolutional neural network architectures to explore the efficacy of unigram and bigram-based representations. Our results demonstrate that while unigram features achieve moderate classification accuracy, bigram-based models significantly improve performance, suggesting that sequential word class patterns are more distinctive of authorial style. Multi-dimensional scaling (MDS) visualizations reveal meaningful clustering of authors' works, supporting the hypothesis that stylistic nuances can be captured through computational methods. These findings highlight the potential of deep learning and linguistic feature analysis for author profiling and literary studies.

Author-Specific Linguistic Patterns Unveiled: A Deep Learning Study on Word Class Distributions

TL;DR

This study investigates author-specific word class distributions using part-of-speech (POS) tagging and bigram analysis to demonstrate the efficacy of unigram and bigram-based representations and reveal meaningful clustering of authors’ works.

Abstract

Deep learning methods have been increasingly applied to computational linguistics to uncover patterns in text data. This study investigates author-specific word class distributions using part-of-speech (POS) tagging and bigram analysis. By leveraging deep neural networks, we classify literary authors based on POS tag vectors and bigram frequency matrices derived from their works. We employ fully connected and convolutional neural network architectures to explore the efficacy of unigram and bigram-based representations. Our results demonstrate that while unigram features achieve moderate classification accuracy, bigram-based models significantly improve performance, suggesting that sequential word class patterns are more distinctive of authorial style. Multi-dimensional scaling (MDS) visualizations reveal meaningful clustering of authors' works, supporting the hypothesis that stylistic nuances can be captured through computational methods. These findings highlight the potential of deep learning and linguistic feature analysis for author profiling and literary studies.
Paper Structure (25 sections, 6 figures, 2 tables)

This paper contains 25 sections, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Histograms of POS tags and bigram matrices The plot shows the frequency of occurrence of several word classes (POS tags, a, c, e, g, i, k) normed by the number of words and the normed frequency of occurrence of word class combinations of consecutive words (bigrams, b, d, f, h, j, l). The results for two books of three authors are shown (Edgar Allan Poe, Jules Verne, Stefan Zweig) .
  • Figure 2: MDS plot of the unigram (a) and bigram (b) frequency vectors of 193 books from 76 different authors a: Multi-Dimensional scaling of 193 POS-tag-vectors referring to the literary works of 76 others. b: Analog anlysis to a for bigram matrices. Bigram matrices 11x11 were flattened before the MDS procedure.
  • Figure 3: MDS analysis for all authors which occur more than 5 times in the data set Note that no new MDS was performed (same as in Fig. \ref{['MDS_All']}) a: POS-tag vectors, b: Bigram-matrices
  • Figure 4: Center of mass and deviation from the center of mass for MDS plots a: POS-tag vectors, b: Bigram-matrices; Each cluster of the 8 different authors is represented by the center of mass of the cluster (focus of the circle) and the average standard deviation in 2D (radius of the circle).
  • Figure 5: Deep learning with POS-tag vectors An simple fully-connected network was trained on author classification. a: Training data set projected using MDS, b: The output of the last layer i.e. embeddings (softmax layer) projected using MDS. c: Test data set projected with MDS, d: Embeddings of test data set. Author classification by the simple usage of POS-tag vectors leads to test accuracy smaller than 0.5 (training accuracy: 0.61, test accuracy: 0.44).
  • ...and 1 more figures