Table of Contents
Fetching ...

From Topic to Transition Structure: Unsupervised Concept Discovery at Corpus Scale via Predictive Associative Memory

Jason Dury

Abstract

Embedding models group text by semantic content, what text is about. We show that temporal co-occurrence within texts discovers a different kind of structure: recurrent transition-structure concepts or what text does. We train a 29.4M-parameter contrastive model on 373 million co-occurrence pairs from 9,766 Project Gutenberg texts (24.96 million passages), mapping pre-trained embeddings into an association space where passages with similar transition structure cluster together. Under capacity constraint (42.75% accuracy), the model must compress across recurring patterns rather than memorise individual co-occurrences. Clustering at six granularities (k=50 to k=2,000) produces a multi-resolution concept map; from broad modes like "direct confrontation" and "lyrical meditation" to precise registers and scene templates like "sailor dialect" and "courtroom cross-examination." At k=100, clusters average 4,508 books each (of 9,766), confirming corpus-wide patterns. Direct comparison with embedding-similarity clustering shows that raw embeddings group by topic while association-space clusters group by function, register, and literary tradition. Unseen novels are assigned to existing clusters without retraining; the association model concentrates each novel into a selective subset of coherent clusters, while raw embedding assignment saturates nearly all clusters. Validation controls address positional, length, and book-concentration confounds. The method extends Predictive Associative Memory (PAM, arXiv:2602.11322) from episodic recall to concept formation: where PAM recalls specific associations, multi-epoch contrastive training under compression extracts structural patterns that transfer to unseen texts, the same framework producing qualitatively different behaviour in a different regime.

From Topic to Transition Structure: Unsupervised Concept Discovery at Corpus Scale via Predictive Associative Memory

Abstract

Embedding models group text by semantic content, what text is about. We show that temporal co-occurrence within texts discovers a different kind of structure: recurrent transition-structure concepts or what text does. We train a 29.4M-parameter contrastive model on 373 million co-occurrence pairs from 9,766 Project Gutenberg texts (24.96 million passages), mapping pre-trained embeddings into an association space where passages with similar transition structure cluster together. Under capacity constraint (42.75% accuracy), the model must compress across recurring patterns rather than memorise individual co-occurrences. Clustering at six granularities (k=50 to k=2,000) produces a multi-resolution concept map; from broad modes like "direct confrontation" and "lyrical meditation" to precise registers and scene templates like "sailor dialect" and "courtroom cross-examination." At k=100, clusters average 4,508 books each (of 9,766), confirming corpus-wide patterns. Direct comparison with embedding-similarity clustering shows that raw embeddings group by topic while association-space clusters group by function, register, and literary tradition. Unseen novels are assigned to existing clusters without retraining; the association model concentrates each novel into a selective subset of coherent clusters, while raw embedding assignment saturates nearly all clusters. Validation controls address positional, length, and book-concentration confounds. The method extends Predictive Associative Memory (PAM, arXiv:2602.11322) from episodic recall to concept formation: where PAM recalls specific associations, multi-epoch contrastive training under compression extracts structural patterns that transfer to unseen texts, the same framework producing qualitatively different behaviour in a different regime.
Paper Structure (29 sections, 1 equation, 5 figures, 6 tables)

This paper contains 29 sections, 1 equation, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Method overview. Texts are chunked into short passages, embedded with BGE-large-en-v1.5, and temporal co-occurrence pairs are extracted within a 15-chunk window. A contrastive MLP maps embeddings into association space; $k$-means clustering at six granularities ($k{=}50$ to $k{=}2{,}000$) produces a multi-resolution concept map.
  • Figure 2: BGE similarity-based clusters group passages by topic (what text is about), while PAM association-based clusters group by transition structure (what text does)---narrative function, discourse register, and literary tradition.
  • Figure 3: Multi-resolution concept structure from $k{=}50$ to $k{=}2{,}000$. Broad narrative modes at coarse resolution decompose into increasingly specific variants---registers, traditions, and scene templates---at finer granularity.
  • Figure 4: Authorial pacing signatures for three canonical works. Different authors produce characteristically different distributions of structural modes, visible at multiple resolutions simultaneously.
  • Figure 5: Cluster selectivity for unseen novels. PAM concentrates each novel into a selective subset of structurally coherent clusters, while BGE assignment saturates nearly all clusters.