Table of Contents
Fetching ...

Hebbian learning the local structure of language

P. Myles Eugenio

TL;DR

The paper presents a locality-driven, unsupervised framework for language learning based on Hebbian plasticity across a hierarchical tokenization stack, supplemented by replay to form semantic embeddings. It introduces the retokenization group that builds higher-order n-gram tokens through projected, smooth representations and uses an energy-based inference mechanism akin to an N-point Ising model to predict next tokens. Replay with auxiliary embedding neurons resolves forgetting and enables compression, yielding a scalable, parallelizable memory (key-value memory) that ties token features to embeddings. Random hierarchies reproduced via replay generate morphology-like distributions, suggesting that neural locality constraints can give rise to the observed structure of natural language without data, with testable predictions for neural signatures of smooth tokens and morphological organization.

Abstract

Learning in the brain is local and unsupervised (Hebbian). We derive the foundations of an effective human language model inspired by these microscopic constraints. It has two parts: (1) a hierarchy of neurons which learns to tokenize words from text (whichiswhatyoudowhenyoureadthis); and (2) additional neurons which bind the learned symanticless patterns of the tokenizer into a symanticful token (an embedding). The model permits continuous parallel learning without forgetting; and is a powerful tokenizer which performs renormalization group. This allows it to exploit redundancy, such that it generates tokens which are always decomposable into a basis set (e.g an alphabet), and can mix features learned from multiple languages. We find that the structure of this model allows it to learn a natural language morphology WITHOUT data. The language data generated by this model predicts the correct distribution of word-forming patterns observed in real languages, and further demonstrates why microscopically human speech is broken up into words. This model provides the basis for understanding the microscopic origins of language and human creativity.

Hebbian learning the local structure of language

TL;DR

The paper presents a locality-driven, unsupervised framework for language learning based on Hebbian plasticity across a hierarchical tokenization stack, supplemented by replay to form semantic embeddings. It introduces the retokenization group that builds higher-order n-gram tokens through projected, smooth representations and uses an energy-based inference mechanism akin to an N-point Ising model to predict next tokens. Replay with auxiliary embedding neurons resolves forgetting and enables compression, yielding a scalable, parallelizable memory (key-value memory) that ties token features to embeddings. Random hierarchies reproduced via replay generate morphology-like distributions, suggesting that neural locality constraints can give rise to the observed structure of natural language without data, with testable predictions for neural signatures of smooth tokens and morphological organization.

Abstract

Learning in the brain is local and unsupervised (Hebbian). We derive the foundations of an effective human language model inspired by these microscopic constraints. It has two parts: (1) a hierarchy of neurons which learns to tokenize words from text (whichiswhatyoudowhenyoureadthis); and (2) additional neurons which bind the learned symanticless patterns of the tokenizer into a symanticful token (an embedding). The model permits continuous parallel learning without forgetting; and is a powerful tokenizer which performs renormalization group. This allows it to exploit redundancy, such that it generates tokens which are always decomposable into a basis set (e.g an alphabet), and can mix features learned from multiple languages. We find that the structure of this model allows it to learn a natural language morphology WITHOUT data. The language data generated by this model predicts the correct distribution of word-forming patterns observed in real languages, and further demonstrates why microscopically human speech is broken up into words. This model provides the basis for understanding the microscopic origins of language and human creativity.

Paper Structure

This paper contains 13 sections, 25 equations, 7 figures.

Figures (7)

  • Figure 1:
  • Figure 2: Infinite strings composed of two repeating words. Boundary information is hidden to prevent the reader from using it to tokenize. The reader can tokenize by looking for the non-word-forming patterns which contain the word boundaries. Only "savevile" and "soldread" do not tokenize into 2 words uniquely.
  • Figure 3: (a) Hierarchy of unique $n$-grams from text taken from Alice in Wonderland. Different curves correspond to increasing text sizes $N_d\in\{235,2336,22762,107777\}$, with $N_d=107777$ being the completed text. Solid lines show fits to log-norm (see appendix): $F(n,1.22,.55,320)$, $F(n,1.35,.45,1700)$, $F(n,1.52,.35,6500)$, & $F(n,1.6,.36,15000)$ resp. (b) Hierarchy of language with $d=2$. Colors show how constraints on earlier levels, ${\color{blue} g^{(2)}_{\bf aa}=0}$ & ${\color{red} g^{(2)}_{\bf bb}=0}$ & ${\color{purple} g^{(2)}_{\bf aa}g^{(2)}_{\bf bb}=0}$, limit the allowed growths at later levels. The hierarchy is stable if both aa & bb are disallowed, but collapses if more (say $g_{\bf aba}^{(3)}=0$) is introduced.
  • Figure 4: The total number of $n$-grams, $\sum_n d_n$, as a function of text size.
  • Figure 5: Rank-ordered frequency distribution of $n$-grams
  • ...and 2 more figures