Hebbian learning the local structure of language
P. Myles Eugenio
TL;DR
The paper presents a locality-driven, unsupervised framework for language learning based on Hebbian plasticity across a hierarchical tokenization stack, supplemented by replay to form semantic embeddings. It introduces the retokenization group that builds higher-order n-gram tokens through projected, smooth representations and uses an energy-based inference mechanism akin to an N-point Ising model to predict next tokens. Replay with auxiliary embedding neurons resolves forgetting and enables compression, yielding a scalable, parallelizable memory (key-value memory) that ties token features to embeddings. Random hierarchies reproduced via replay generate morphology-like distributions, suggesting that neural locality constraints can give rise to the observed structure of natural language without data, with testable predictions for neural signatures of smooth tokens and morphological organization.
Abstract
Learning in the brain is local and unsupervised (Hebbian). We derive the foundations of an effective human language model inspired by these microscopic constraints. It has two parts: (1) a hierarchy of neurons which learns to tokenize words from text (whichiswhatyoudowhenyoureadthis); and (2) additional neurons which bind the learned symanticless patterns of the tokenizer into a symanticful token (an embedding). The model permits continuous parallel learning without forgetting; and is a powerful tokenizer which performs renormalization group. This allows it to exploit redundancy, such that it generates tokens which are always decomposable into a basis set (e.g an alphabet), and can mix features learned from multiple languages. We find that the structure of this model allows it to learn a natural language morphology WITHOUT data. The language data generated by this model predicts the correct distribution of word-forming patterns observed in real languages, and further demonstrates why microscopically human speech is broken up into words. This model provides the basis for understanding the microscopic origins of language and human creativity.
