Self-Organizing Language
P. Myles Eugenio, Anthony Beavers
TL;DR
The work presents a theory and framework for self-organizing language where global symbolic order emerges from locally coordinated learning. It introduces the retokenizer, a hierarchical, locally updated memory that builds higher-order tokens via retokenization projections P_n, yielding implicit n-point interactions and a topologically protected ground-state automaton. Through energy-based inference, left/right retokenization, and a replay mechanism, the model forms long-term symbol-like embeddings a_α and a hierarchical key-value memory for recognition, all without a global objective or backpropagation. The approach demonstrates how subword structure, finite word length, and Zipf-like memory spectra arise from locality and smoothness constraints, and discusses compression via word-specific projector decompositions, super-hierarchies, and interpretability via embedding trees. The work argues for a physics-inspired, neuro-symbolic account of language emergence, contrasting it with non-local LLMs and highlighting potential insights into inter-generational learning and cognitive grounding.
Abstract
We introduce a novel paradigm of emergent local memory. It is a continuous-learning completely-parallel content-addressable memory encoding global order. It demonstrates how local constraints on uncoordinated learning can produce topologically protected memories realizing emergent symbolic order. It is therefore a neuro-symbolic bridge. It further has the ability to produce human language without data, by exploiting its own self-organizing dynamics. It teaches us that words arise as a side-effect of emergent symbolic order, and that human language patterns at all structural levels reflect a universal mechanism of word formation (which is subregular). This work answers essential questions about the existence \& origin of all the human language data.
