The Detection and Understanding of Fictional Discourse
Andrew Piper, Haiqi Zhou
TL;DR
This work tackles the detection of fictional discourse across a wide range of corpora, introducing WordNet-based supersense features derived from BookNLP and evaluating them with random-forest classifiers and BERT fine-tuning. The study demonstrates strong cross-domain discriminability between fiction and non-fiction at both document and sentence levels, with high F1 scores across many datasets and clear semantic patterns, such as embodied, sensorimotor cues. It also investigates domain effects, noting GPT-generated stories are more readily learned as fiction within certain genres (notably folk tales) and that training data size influences cross-cultural accuracy. The findings suggest that semantic generalization via supersenses can illuminate the distinctive qualities of fictional discourse and aid in indexing cultural heritage archives, while pointing to multilingual and structural feature extensions for broader applicability.
Abstract
In this paper, we present a variety of classification experiments related to the task of fictional discourse detection. We utilize a diverse array of datasets, including contemporary professionally published fiction, historical fiction from the Hathi Trust, fanfiction, stories from Reddit, folk tales, GPT-generated stories, and anglophone world literature. Additionally, we introduce a new feature set of word "supersenses" that facilitate the goal of semantic generalization. The detection of fictional discourse can help enrich our knowledge of large cultural heritage archives and assist with the process of understanding the distinctive qualities of fictional storytelling more broadly.
