Document Author Classification Using Parsed Language Structure

Todd K Moon; Jacob H. Gunther

Document Author Classification Using Parsed Language Structure

Todd K Moon, Jacob H. Gunther

TL;DR

This paper provides a proof of concept, testing author classification based on grammatical structure on a set of “proof texts,” The Federalist Papers and Sanditon, which have been as test cases in previous authorship detection studies.

Abstract

Over the years there has been ongoing interest in detecting authorship of a text based on statistical properties of the text, such as by using occurrence rates of noncontextual words. In previous work, these techniques have been used, for example, to determine authorship of all of \emph{The Federalist Papers}. Such methods may be useful in more modern times to detect fake or AI authorship. Progress in statistical natural language parsers introduces the possibility of using grammatical structure to detect authorship. In this paper we explore a new possibility for detecting authorship using grammatical structural information extracted using a statistical natural language parser. This paper provides a proof of concept, testing author classification based on grammatical structure on a set of "proof texts," The Federalist Papers and Sanditon which have been as test cases in previous authorship detection studies. Several features extracted from the statistical natural language parser were explored: all subtrees of some depth from any level; rooted subtrees of some depth, part of speech, and part of speech by level in the parse tree. It was found to be helpful to project the features into a lower dimensional space. Statistical experiments on these documents demonstrate that information from a statistical parser can, in fact, assist in distinguishing authors.

Document Author Classification Using Parsed Language Structure

TL;DR

Abstract

Paper Structure (22 sections, 15 equations, 15 figures, 17 tables)

This paper contains 22 sections, 15 equations, 15 figures, 17 tables.

Introduction and Background
Statistical Parsing and Extracted Features
Parse Tree Features
All Subtrees
Rooted Subtrees
Part-of-Speech
POS by Level
Classifier
Dimension Reduction
The Federalist Papers
All Subtrees
Rooted Subtrees
POS
POS by Level
Sanditon
...and 7 more sections

Figures (15)

Figure 1: Example parse tree
Figure 2: Some subtrees of depth 3 extracted from the tree in (\ref{['eq:treeparse1']})
Figure 3: Rooted Subtrees of the tree in (\ref{['eq:treeparse1']}) of one, two, and three levels
Figure 4: Illustration of within-cluster and between cluster scattering and projection.
Figure 5: Classification of Federalist papers based on "all subtree" feature vectors
...and 10 more figures

Document Author Classification Using Parsed Language Structure

TL;DR

Abstract

Document Author Classification Using Parsed Language Structure

Authors

TL;DR

Abstract

Table of Contents

Figures (15)