Table of Contents
Fetching ...

The Dragon Hatchling: The Missing Link between the Transformer and Models of the Brain

Adrian Kosowski, Przemysław Uznański, Jan Chorowski, Zuzanna Stamirowska, Michał Bartoszkiewicz

TL;DR

This work introduces Dragon Hatchling (BDH), a Large Language Model architecture grounded in scale-free, brain-inspired local graph dynamics. By modeling inference as edge-reweighting on neuron graphs and implementing a GPU-friendly BDH-GPU variant with ReLU-lowrank blocks and linear attention, the approach preserves Transformer-like scaling while improving interpretability and brain- plausibility. The authors establish theoretical links between BDH and brain models, show emergent modular, scale-free network structure, and demonstrate Transformer-like performance on language and translation tasks with 10M–1B parameter scales. They further demonstrate interpretability through monosemantic synapses, sparse activations, and a micro-foundational perspective on attention, arguing for axiomatic AI and potential brain-science insights. Overall, BDH offers a principled, scalable path to long-context reasoning with interpretable dynamics and practical GPU efficiency, bridging Transformer machinery and brain-inspired computation.

Abstract

The relationship between computing systems and the brain has served as motivation for pioneering theoreticians since John von Neumann and Alan Turing. Uniform, scale-free biological networks, such as the brain, have powerful properties, including generalizing over time, which is the main barrier for Machine Learning on the path to Universal Reasoning Models. We introduce `Dragon Hatchling' (BDH), a new Large Language Model architecture based on a scale-free biologically inspired network of \$n\$ locally-interacting neuron particles. BDH couples strong theoretical foundations and inherent interpretability without sacrificing Transformer-like performance. BDH is a practical, performant state-of-the-art attention-based state space sequence learning architecture. In addition to being a graph model, BDH admits a GPU-friendly formulation. It exhibits Transformer-like scaling laws: empirically BDH rivals GPT2 performance on language and translation tasks, at the same number of parameters (10M to 1B), for the same training data. BDH can be represented as a brain model. The working memory of BDH during inference entirely relies on synaptic plasticity with Hebbian learning using spiking neurons. We confirm empirically that specific, individual synapses strengthen connection whenever BDH hears or reasons about a specific concept while processing language inputs. The neuron interaction network of BDH is a graph of high modularity with heavy-tailed degree distribution. The BDH model is biologically plausible, explaining one possible mechanism which human neurons could use to achieve speech. BDH is designed for interpretability. Activation vectors of BDH are sparse and positive. We demonstrate monosemanticity in BDH on language tasks. Interpretability of state, which goes beyond interpretability of neurons and model parameters, is an inherent feature of the BDH architecture.

The Dragon Hatchling: The Missing Link between the Transformer and Models of the Brain

TL;DR

This work introduces Dragon Hatchling (BDH), a Large Language Model architecture grounded in scale-free, brain-inspired local graph dynamics. By modeling inference as edge-reweighting on neuron graphs and implementing a GPU-friendly BDH-GPU variant with ReLU-lowrank blocks and linear attention, the approach preserves Transformer-like scaling while improving interpretability and brain- plausibility. The authors establish theoretical links between BDH and brain models, show emergent modular, scale-free network structure, and demonstrate Transformer-like performance on language and translation tasks with 10M–1B parameter scales. They further demonstrate interpretability through monosemantic synapses, sparse activations, and a micro-foundational perspective on attention, arguing for axiomatic AI and potential brain-science insights. Overall, BDH offers a principled, scalable path to long-context reasoning with interpretable dynamics and practical GPU efficiency, bridging Transformer machinery and brain-inspired computation.

Abstract

The relationship between computing systems and the brain has served as motivation for pioneering theoreticians since John von Neumann and Alan Turing. Uniform, scale-free biological networks, such as the brain, have powerful properties, including generalizing over time, which is the main barrier for Machine Learning on the path to Universal Reasoning Models. We introduce `Dragon Hatchling' (BDH), a new Large Language Model architecture based on a scale-free biologically inspired network of \ locally-interacting neuron particles. BDH couples strong theoretical foundations and inherent interpretability without sacrificing Transformer-like performance. BDH is a practical, performant state-of-the-art attention-based state space sequence learning architecture. In addition to being a graph model, BDH admits a GPU-friendly formulation. It exhibits Transformer-like scaling laws: empirically BDH rivals GPT2 performance on language and translation tasks, at the same number of parameters (10M to 1B), for the same training data. BDH can be represented as a brain model. The working memory of BDH during inference entirely relies on synaptic plasticity with Hebbian learning using spiking neurons. We confirm empirically that specific, individual synapses strengthen connection whenever BDH hears or reasons about a specific concept while processing language inputs. The neuron interaction network of BDH is a graph of high modularity with heavy-tailed degree distribution. The BDH model is biologically plausible, explaining one possible mechanism which human neurons could use to achieve speech. BDH is designed for interpretability. Activation vectors of BDH are sparse and positive. We demonstrate monosemanticity in BDH on language tasks. Interpretability of state, which goes beyond interpretability of neurons and model parameters, is an inherent feature of the BDH architecture.

Paper Structure

This paper contains 106 sections, 29 equations, 17 figures, 5 tables.

Figures (17)

  • Figure 1: General overview of architectures and their relationships: the inference dynamics of BDH and BDH-GPU act as a natural bridge between Transformer and models of the brain. The two main inference mechanisms of a reasoning architecture, attention and the feed-forward network, are defined at a macro-level through tensor operations for the Transformer, and at the micro-level of neuron interactions through local graph dynamics for Brain models. The new BDH-GPU architecture is naturally defined both at the level of vectors and of particle dynamics of neurons and synapses, acting as a bridge between these two approaches. See also Table \ref{['tab:comparison']} at the end of the paper for a more detailed comparison of architecture properties.
  • Figure 2: The 'physical system' representation of BDH as a physical graph toy-model.
  • Figure 3: State-space equations of the model architectures introduced in this paper. All architectures refer to a set of $n$ interacting particles (neurons), with activation vectors $x_{t,l} \in (R^+)^{n}$. Vector $y_{t,l} \in (R^+)^{n}$, $y_{t,l}$ is (typically) sparse in the sense of $\|y_{t,l}\|_0$. Variables ${\boldsymbol\rho}_{t,l} \in R^{n \times d}$ or ${\boldsymbol\sigma}_{t,l} \in R^{n \times n}$ represent hidden state of the system. $\diamond$ The graph-based BDH dynamics equation \ref{['eq:bdhgraph']}, equivalent to the ruleset from Table \ref{['tab:protocolx']}, serves as a starting point for development of architectures represented as local graph kernels in a distributed computing system. $\diamond$ The simplified BDH-Normfree equation \ref{['eq:bdhnoln']} is a special case of BDH. Up to lack of LayerNorms, it approximates the inference dynamics of BDH-GPU, with the correspondence ${\boldsymbol\rho}_{t,l}=E{\boldsymbol\sigma}_{t,l}$. $\diamond$ The tensor-based BDH-GPU architecture is described by equations \ref{['eq:bdh']} (mathematically equivalent to Definition \ref{['def:bdh']}, Eq. \ref{['eq:integral']} and \ref{['eq:kvstate']}) and is the primary point of reference for all model training and all empirical results presented in this study. For a discussion of extensions to BDH-GPU such as heads, see Subsection \ref{['sec:layersheads']}. A complete code listing for BDH-GPU is provided in Appendix \ref{['sec:bdh_code_listing']}.
  • Figure 4: Scaling of BDH-GPU architecture in dimension $n$. The other parameters can be considered fixed during scaling. For example, with choice of $d=256$ for low-rank dimension, $k=2$ for neuron pairing with RoPE, and $h=1$ for a single-head architecture, the model scales linearly in dimension $n$ in chunks of $dhk = 256\cdot2\cdot1=512$ parameters.
  • Figure 5: Neuron-neuron communication using graphs $G \in \mathcal{G}^2(n,m)$: correspondence between graph $H$ with $m$ edges (left), and neuron-neuron interaction graph $G = H^2$ (right). The approach allows to express linear signal propagation on a broad class of graphs $\mathcal{G}^2(n,m)$ using two steps of linear dynamics on a sparse circuit $H$, i.e., $Gz = H^2z$ for $z \in (R^+)^n$.
  • ...and 12 more figures

Theorems & Definitions (22)

  • Claim 1: informal overview of theoretical results for BDH
  • Claim 2: informal overview of theoretical results for BDH-GPU
  • Definition 1: Interaction kernel, general form
  • Definition 2: edge-reweighting kernel
  • Definition 3
  • proof
  • Definition 4: inference dynamics of BDH-GPU
  • Claim 3
  • proof
  • Claim 4
  • ...and 12 more