Table of Contents
Fetching ...

Scalable Machines with Intrinsic Higher Mental-State Dynamics

Ahsan Adeel, M. Bilal

Abstract

Drawing on recent breakthroughs in cellular neurobiology and detailed biophysical modeling linking neocortical pyramidal neurons to distinct mental-state regimes, this work introduces a mathematically grounded formulation showing how models (e.g., Transformers) can implement computational principles underlying awake imaginative thought to pre-select relevant information before attention is applied via triadic modulation loops among queries ($Q$), keys ($K$), and values ($V$).~Scalability experiments on ImageNet-1K, benchmarked against a standard Vision Transformer (ViT), demonstrate significantly faster learning with reduced computational demand (fewer heads, layers, and tokens), consistent with our prior findings in reinforcement learning and language modeling. The approach operates at approximately $\mathcal{O}(N)$ complexity with respect to the number of input tokens $N$.

Scalable Machines with Intrinsic Higher Mental-State Dynamics

Abstract

Drawing on recent breakthroughs in cellular neurobiology and detailed biophysical modeling linking neocortical pyramidal neurons to distinct mental-state regimes, this work introduces a mathematically grounded formulation showing how models (e.g., Transformers) can implement computational principles underlying awake imaginative thought to pre-select relevant information before attention is applied via triadic modulation loops among queries (), keys (), and values ().~Scalability experiments on ImageNet-1K, benchmarked against a standard Vision Transformer (ViT), demonstrate significantly faster learning with reduced computational demand (fewer heads, layers, and tokens), consistent with our prior findings in reinforcement learning and language modeling. The approach operates at approximately complexity with respect to the number of input tokens .
Paper Structure (19 sections, 40 equations, 8 figures, 8 tables)

This paper contains 19 sections, 40 equations, 8 figures, 8 tables.

Figures (8)

  • Figure 1: Integrated view of biologically inspired mechanisms: (a) Pyramidal two-point neuron; (b) $\rm{Co}^4$ triadic reasoning via $Q$–$K$–$V$ interactions; (c–d) MOD function dynamics illustrating context-sensitive filtering; $Y$ represents the FF signal, which is separated into relevant and irrelevant streams depending on the strength of $C$; (e–f) MOD contours and vector fields across the $R$–$C$ space. Surface plots are presented over a wider range of $R$ and $C$ to illustrate the global geometry of the cooperation landscape, while contour and vector-field visualizations focus on a smaller range to highlight local gradient flow and regime transitions near the origin. Variations in the strengths of $R$ and $C$ shift the system across distinct processing regimes analogous to the neurobiological AA, AD, and AD+Awake regimes, producing corresponding geometric deformations in gradient flow. By shaping representations prior to downstream readout, these modulation laws guide optimization along $R–C$ interaction manifolds, reducing propagation through noisy or irrelevant directions.
  • Figure 2: Early training comparison between an attention-only Vision Transformer (ViT) dosovitskiy2020image, trained from scratch, and a Co$^4$ machine endowed with intrinsic mental-state-dependent processing regimes analogous to awake imaginative processing Phillips2024cellulargraham2025context, which pre-select relevant information before attention is applied. The task is to identify a bird from the Mini-ImageNet dataset. Brightness indicates regions emphasized. In the ViT model, this is after attention. In contrast, Co$^4$ rapidly forms a coherent interpretation of the input, highlighting the top-$k$ salient regions via internally generated awake imaginative regimes before attention is computed. Co$^4$ exhibits earlier and sharper activation over the semantically relevant object (bird), indicating more coherent internal inference. These findings raise questions about the necessity of deep attention stacks.
  • Figure 3: The figure visualizes the complete attention distribution over N input tokens: a single-layer Co$^4$ machine versus an attention-only ViT dosovitskiy2020image, both trained on Mini-ImageNet for 30 epochs. The ViT exhibits more dispersed attention with less selective localization. In contrast, Co$^4$ demonstrates more centered, context-sensitive activation patterns, indicating stronger spatial coherence.
  • Figure 4: Co$^4$ versus Transformer on CIFAR-10, Tiny-ImageNet, and Mini-ImageNet, trained from scratch: (i–iii) performance of a single-layer model; (iv) inference runtime as a function of sequence length for a single layer (see A.6 for theoretical computational cost analysis); and (v–vi) layer-wise validation accuracy on Tiny-ImageNet and Mini-ImageNet.
  • Figure 5: Training results across CartPole (i–iii), PyBullet Ant (iv–v), CarRacing (visual input: $96 \times 96 \times 4$) (vi), and Acrobot (vii), with heat maps (viii–ix) provided as empirical evidence. $T_{M1}$–$T_{M4}$ denote alternative well-established TPN-inspired asynchronous MOD functions kay2020contextual (see A.3).
  • ...and 3 more figures