Table of Contents
Fetching ...

Semantic Novelty at Scale: Narrative Shape Taxonomy and Readership Prediction in 28,606 Books

W. Frederick Zimmerman

Abstract

I introduce semantic novelty--cosine distance between each paragraph's sentence embedding and the running centroid of all preceding paragraphs--as an information-theoretic measure of narrative structure at corpus scale. Applying it to 28,606 books in PG19 (pre-1920 English literature), I compute paragraph-level novelty curves using 768-dimensional SBERT embeddings, then reduce each to a 16-segment Piecewise Aggregate Approximation (PAA). Ward-linkage clustering on PAA vectors reveals eight canonical narrative shape archetypes, from Steep Descent (rapid convergence) to Steep Ascent (escalating unpredictability). Volume--variance of the novelty trajectory--is the strongest length-independent predictor of readership (partial rho = 0.32), followed by speed (rho = 0.19) and Terminal/Initial ratio (rho = 0.19). Circuitousness shows strong raw correlation (rho = 0.41) but is 93 percent correlated with length; after control, partial rho drops to 0.11--demonstrating that naive correlations in corpus studies can be dominated by length confounds. Genre strongly constrains narrative shape (chi squared = 2121.6, p < 10 to the power negative 242), with fiction maintaining plateau profiles while nonfiction front-loads information. Historical analysis shows books became progressively more predictable between 1840 and 1910 (T/I ratio trend r = negative 0.74, p = 0.037). SAX analysis reveals 85 percent signature uniqueness, suggesting each book traces a nearly unique path through semantic space. These findings demonstrate that information-density dynamics, distinct from sentiment or topic, constitute a fundamental dimension of narrative structure with measurable consequences for reader engagement. Dataset: https://huggingface.co/datasets/wfzimmerman/pg19-semantic-novelty

Semantic Novelty at Scale: Narrative Shape Taxonomy and Readership Prediction in 28,606 Books

Abstract

I introduce semantic novelty--cosine distance between each paragraph's sentence embedding and the running centroid of all preceding paragraphs--as an information-theoretic measure of narrative structure at corpus scale. Applying it to 28,606 books in PG19 (pre-1920 English literature), I compute paragraph-level novelty curves using 768-dimensional SBERT embeddings, then reduce each to a 16-segment Piecewise Aggregate Approximation (PAA). Ward-linkage clustering on PAA vectors reveals eight canonical narrative shape archetypes, from Steep Descent (rapid convergence) to Steep Ascent (escalating unpredictability). Volume--variance of the novelty trajectory--is the strongest length-independent predictor of readership (partial rho = 0.32), followed by speed (rho = 0.19) and Terminal/Initial ratio (rho = 0.19). Circuitousness shows strong raw correlation (rho = 0.41) but is 93 percent correlated with length; after control, partial rho drops to 0.11--demonstrating that naive correlations in corpus studies can be dominated by length confounds. Genre strongly constrains narrative shape (chi squared = 2121.6, p < 10 to the power negative 242), with fiction maintaining plateau profiles while nonfiction front-loads information. Historical analysis shows books became progressively more predictable between 1840 and 1910 (T/I ratio trend r = negative 0.74, p = 0.037). SAX analysis reveals 85 percent signature uniqueness, suggesting each book traces a nearly unique path through semantic space. These findings demonstrate that information-density dynamics, distinct from sentiment or topic, constitute a fundamental dimension of narrative structure with measurable consequences for reader engagement. Dataset: https://huggingface.co/datasets/wfzimmerman/pg19-semantic-novelty
Paper Structure (35 sections, 2 equations, 6 figures, 6 tables)

This paper contains 35 sections, 2 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Eight narrative shape archetypes discovered by Ward-linkage hierarchical clustering on 16-segment PAA vectors from 28,606 books. Each panel shows the centroid curve (bold) with the 25th--75th percentile envelope (shaded). These eight shapes---from Steep Descent to Steep Ascent---constitute the core taxonomy of this paper.
  • Figure 2: The three canonical novelty trajectory types: convergent (green, 22.5%), plateau (blue, 59.0%), and divergent (red, 18.6%). These legacy categories are refined into eight archetypes by Ward-linkage clustering (Figure \ref{['fig:hero_archetypes']}).
  • Figure 3: Distribution of $\log_{10}$(downloads) across the eight narrative clusters. Download distributions vary systematically by cluster, reflecting the interplay between narrative shape, book length, and readership patterns explored in Section \ref{['sec:results']}.
  • Figure 4: Genre composition within each of the eight narrative clusters. Fiction dominates plateau clusters (Flat, Late Plateau); History and Other dominate descent and ascent clusters. Genre and cluster are strongly associated ($\chi^2 = 543.2$, $p = 2.41 \times 10^{-74}$).
  • Figure 5: Circuitousness vs. $\log_{10}$(downloads), colored by curve type. The raw Spearman $\rho = 0.41$ is substantially confounded by book length ($\rho_{\text{circ,length}} = 0.93$); after length control, the partial $\rho = 0.11$. See Table \ref{['tab:toubia']} for length-controlled rankings.
  • ...and 1 more figures