Table of Contents
Fetching ...

Self-Organizing Language

P. Myles Eugenio, Anthony Beavers

TL;DR

The work presents a theory and framework for self-organizing language where global symbolic order emerges from locally coordinated learning. It introduces the retokenizer, a hierarchical, locally updated memory that builds higher-order tokens via retokenization projections P_n, yielding implicit n-point interactions and a topologically protected ground-state automaton. Through energy-based inference, left/right retokenization, and a replay mechanism, the model forms long-term symbol-like embeddings a_α and a hierarchical key-value memory for recognition, all without a global objective or backpropagation. The approach demonstrates how subword structure, finite word length, and Zipf-like memory spectra arise from locality and smoothness constraints, and discusses compression via word-specific projector decompositions, super-hierarchies, and interpretability via embedding trees. The work argues for a physics-inspired, neuro-symbolic account of language emergence, contrasting it with non-local LLMs and highlighting potential insights into inter-generational learning and cognitive grounding.

Abstract

We introduce a novel paradigm of emergent local memory. It is a continuous-learning completely-parallel content-addressable memory encoding global order. It demonstrates how local constraints on uncoordinated learning can produce topologically protected memories realizing emergent symbolic order. It is therefore a neuro-symbolic bridge. It further has the ability to produce human language without data, by exploiting its own self-organizing dynamics. It teaches us that words arise as a side-effect of emergent symbolic order, and that human language patterns at all structural levels reflect a universal mechanism of word formation (which is subregular). This work answers essential questions about the existence \& origin of all the human language data.

Self-Organizing Language

TL;DR

The work presents a theory and framework for self-organizing language where global symbolic order emerges from locally coordinated learning. It introduces the retokenizer, a hierarchical, locally updated memory that builds higher-order tokens via retokenization projections P_n, yielding implicit n-point interactions and a topologically protected ground-state automaton. Through energy-based inference, left/right retokenization, and a replay mechanism, the model forms long-term symbol-like embeddings a_α and a hierarchical key-value memory for recognition, all without a global objective or backpropagation. The approach demonstrates how subword structure, finite word length, and Zipf-like memory spectra arise from locality and smoothness constraints, and discusses compression via word-specific projector decompositions, super-hierarchies, and interpretability via embedding trees. The work argues for a physics-inspired, neuro-symbolic account of language emergence, contrasting it with non-local LLMs and highlighting potential insights into inter-generational learning and cognitive grounding.

Abstract

We introduce a novel paradigm of emergent local memory. It is a continuous-learning completely-parallel content-addressable memory encoding global order. It demonstrates how local constraints on uncoordinated learning can produce topologically protected memories realizing emergent symbolic order. It is therefore a neuro-symbolic bridge. It further has the ability to produce human language without data, by exploiting its own self-organizing dynamics. It teaches us that words arise as a side-effect of emergent symbolic order, and that human language patterns at all structural levels reflect a universal mechanism of word formation (which is subregular). This work answers essential questions about the existence \& origin of all the human language data.

Paper Structure

This paper contains 59 sections, 88 equations, 14 figures, 2 algorithms.

Figures (14)

  • Figure 1: Local interactions ($\tilde{v}^{(n-1)}\otimes v$) drive new model growths ($\tilde{v}^{(n)}$) when input events align with the allowed hierarchical DAG structure. Stored strings are induced global order parameters ($v\otimes v\otimes\cdots\otimes v\rightarrow\tilde{v}^{(n)}$), each living as a terminal node of the DAG. Each growth is compressed/decompressed as a matrix product state (MPS).
  • Figure 2: Summarizing the relationship between retokenization, locality, and hierarchy. Three words from a random language Eugenio2025_HebbianLanguage (shown in blue): yjjfqi, kltilj, & yjjfgsp.
  • Figure 3: Comparing Hamiltonians of topologically protected global order vs unprotected. Context overlapping with a feature is driven to the fixed point (energy extremum) by a restoring force. We use convention for energy $E=-H$.
  • Figure 4: Hierarchical short-term memory retrieves via flows toward terminal nodes. Multiple terminal nodes compete during retrieval if they share features. However, retrieval becomes scale invariant beyond last overlap. Note: DAG's are disentangled during formation of long-term memory---see Fig \ref{['Fig:plasticity&replay']}.
  • Figure 5: 500 words from Alice & Wonderland GutenbergAlice. Solid line is log-normal fit. The shortest lengthscale patterns are biased by the mechanical constraints of pronunciation/listening. But these biases don't account for why word length is finite. Structural constraints of memory become relevant at length scales $n>n_{\text{peak}}$, leading to the $d_n$ collapse, and limiting the length of the longest pattern.
  • ...and 9 more figures