Table of Contents
Fetching ...

StoryScope: Investigating idiosyncrasies in AI fiction

Jenna Russell, Rishanth Rajendhran, Mohit Iyyer, John Wieting

Abstract

As AI-generated fiction becomes increasingly prevalent, questions of authorship and originality are becoming central to how written work is evaluated. While most existing work in this space focuses on identifying surface-level signatures of AI writing, we ask instead whether AI-generated stories can be distinguished from human ones without relying on stylistic signals, focusing on discourse-level narrative choices such as character agency and chronological discontinuity. We propose StoryScope, a pipeline that automatically induces a fine-grained, interpretable feature space of discourse-level narrative features across 10 dimensions. We apply StoryScope to a parallel corpus of 10,272 writing prompts, each written by a human author and five LLMs, yielding 61,608 stories, each ~5,000 words, and 304 extracted features per story. Narrative features alone achieve 93.2% macro-F1 for human vs. AI detection and 68.4% macro-F1 for six-way authorship attribution, retaining over 97% of the performance of models that include stylistic cues. A compact set of 30 core narrative features captures much of this signal: AI stories over-explain themes and favor tidy, single-track plots while human stories frame protagonist' choices as more morally ambiguous and have increased temporal complexity. Per-model fingerprint features enable six-way attribution: for example, Claude produces notably flat event escalation, GPT over-indexes on dream sequences, and Gemini defaults to external character description. We find that AI-generated stories cluster in a shared region of narrative space, while human-authored stories exhibit greater diversity. More broadly, these results suggest that differences in underlying narrative construction, not just writing style, can be used to separate human-written original works from AI-generated fiction.

StoryScope: Investigating idiosyncrasies in AI fiction

Abstract

As AI-generated fiction becomes increasingly prevalent, questions of authorship and originality are becoming central to how written work is evaluated. While most existing work in this space focuses on identifying surface-level signatures of AI writing, we ask instead whether AI-generated stories can be distinguished from human ones without relying on stylistic signals, focusing on discourse-level narrative choices such as character agency and chronological discontinuity. We propose StoryScope, a pipeline that automatically induces a fine-grained, interpretable feature space of discourse-level narrative features across 10 dimensions. We apply StoryScope to a parallel corpus of 10,272 writing prompts, each written by a human author and five LLMs, yielding 61,608 stories, each ~5,000 words, and 304 extracted features per story. Narrative features alone achieve 93.2% macro-F1 for human vs. AI detection and 68.4% macro-F1 for six-way authorship attribution, retaining over 97% of the performance of models that include stylistic cues. A compact set of 30 core narrative features captures much of this signal: AI stories over-explain themes and favor tidy, single-track plots while human stories frame protagonist' choices as more morally ambiguous and have increased temporal complexity. Per-model fingerprint features enable six-way attribution: for example, Claude produces notably flat event escalation, GPT over-indexes on dream sequences, and Gemini defaults to external character description. We find that AI-generated stories cluster in a shared region of narrative space, while human-authored stories exhibit greater diversity. More broadly, these results suggest that differences in underlying narrative construction, not just writing style, can be used to separate human-written original works from AI-generated fiction.

Paper Structure

This paper contains 64 sections, 8 figures, 16 tables.

Figures (8)

  • Figure 1: Overview of the StoryScope pipeline. Stories are converted into structured templates, then compared across sources writing to the same prompt to induce discriminative narrative features, and finally featurized across the full corpus for downstream detection and authorship experiments. Story inspired by "Tiny and the Monster" sturgeon1947tiny.
  • Figure 2: Projection of narrative feature vectors onto the first two linear discriminant components. Human writing occupies a distinct region; the five AI models cluster together. Claude is the most distinct of the 5 AI models, Gemini and DeepSeek the nearest neighbors.
  • Figure 3: Confusion matrix for authorship attribution (narrative model) as a percentage (%). Misclassifications concentrate among AI models, particularly DeepSeek--Gemini--Kimi.
  • Figure 4: Boxplots of story lengths across all stories in the finalized six-sources dataset, shown separately for the human corpus and each model. Boxes show the interquartile range, center lines show medians, and whiskers show the non-outlier range.
  • Figure 5: Per-story narrative rarity percentiles by sources on the held-out test set. Solid lines show means; dashed lines show medians. Human stories are shifted toward higher rarity but all distributions overlap substantially.
  • ...and 3 more figures