Table of Contents
Fetching ...

Semantic Novelty Trajectories in 80,000 Books: A Cross-Corpus Embedding Analysis

Fred Zimmerman

TL;DR

I apply Schmidhuber's compression progress theory of interestingness at corpus scale, analyzing semantic novelty trajectories in more than 80,000 books spanning two centuries of English-language publishing, revealing eight distinct narrative-shape archetypes whose distribution shifts substantially between eras.

Abstract

I apply Schmidhuber's compression progress theory of interestingness at corpus scale, analyzing semantic novelty trajectories in more than 80,000 books spanning two centuries of English-language publishing. Using sentence-transformer paragraph embeddings and a running-centroid novelty measure, I compare 28,730 pre-1920 Project Gutenberg books (PG19) against 52,796 modern English books (Books3, approximately 1990-2010). The principal findings are fourfold. First, mean paragraph-level novelty is roughly 10% higher in modern books (0.503 vs. 0.459). Second, trajectory circuitousness -- the ratio of cumulative path length to net displacement in embedding space -- nearly doubles in the modern corpus (+67%). Third, convergent narrative curves, in which novelty declines toward a settled semantic register, are 2.3x more common in pre-1920 literature. Fourth, novelty is orthogonal to reader quality ratings (r = -0.002), suggesting that interestingness in Schmidhuber's sense is structurally independent of perceived literary merit. Clustering paragraph-level trajectories via PAA-16 representations reveals eight distinct narrative-shape archetypes whose distribution shifts substantially between eras. All analysis code and an interactive exploration toolkit are publicly available at https://bigfivekiller.online/novelty_hub.

Semantic Novelty Trajectories in 80,000 Books: A Cross-Corpus Embedding Analysis

TL;DR

I apply Schmidhuber's compression progress theory of interestingness at corpus scale, analyzing semantic novelty trajectories in more than 80,000 books spanning two centuries of English-language publishing, revealing eight distinct narrative-shape archetypes whose distribution shifts substantially between eras.

Abstract

I apply Schmidhuber's compression progress theory of interestingness at corpus scale, analyzing semantic novelty trajectories in more than 80,000 books spanning two centuries of English-language publishing. Using sentence-transformer paragraph embeddings and a running-centroid novelty measure, I compare 28,730 pre-1920 Project Gutenberg books (PG19) against 52,796 modern English books (Books3, approximately 1990-2010). The principal findings are fourfold. First, mean paragraph-level novelty is roughly 10% higher in modern books (0.503 vs. 0.459). Second, trajectory circuitousness -- the ratio of cumulative path length to net displacement in embedding space -- nearly doubles in the modern corpus (+67%). Third, convergent narrative curves, in which novelty declines toward a settled semantic register, are 2.3x more common in pre-1920 literature. Fourth, novelty is orthogonal to reader quality ratings (r = -0.002), suggesting that interestingness in Schmidhuber's sense is structurally independent of perceived literary merit. Clustering paragraph-level trajectories via PAA-16 representations reveals eight distinct narrative-shape archetypes whose distribution shifts substantially between eras. All analysis code and an interactive exploration toolkit are publicly available at https://bigfivekiller.online/novelty_hub.
Paper Structure (38 sections, 2 equations, 4 figures, 4 tables)

This paper contains 38 sections, 2 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Distribution of mean book-level novelty for PG19 (blue) and Books3 (orange). The modern corpus is shifted rightward, indicating systematically higher novelty.
  • Figure 2: Trajectory cluster prevalence in PG19 vs. Books3. The Flat archetype dominates pre-1920 literature; Gradual Rise dominates the modern corpus.
  • Figure 3: Poetry trajectory comparison. Left: representative PG19 poem showing low circuitousness (tight semantic oscillation). Right: representative Books3 poem showing high circuitousness (wide semantic exploration).
  • Figure 4: Curve-type distribution by genre across corpora. Structural genre signatures persist across eras despite overall shifts in novelty levels.