Table of Contents
Fetching ...

Single layer tiny Co$^4$ outpaces GPT-2 and GPT-BERT

Noor Ul Zain, Mohsin Raza, Ahsan Adeel

TL;DR

The paper introduces Co^4, a tiny single-layer language model with $8\mathrm{M}$ parameters that employs triadic Q-K-V TPNs and two input integration points to achieve linear-time training ($O(N)$) versus the quadratic scaling of Transformer baselines. Trained on a 10M-token BabyLM slice, Co^4 achieves competitive zero-shot and SuperGLUE finetuning performance, outperforming GPT-2 and GPT-BERT on multiple metrics in only 2 epochs. The approach demonstrates strong sample efficiency and generalization, challenging prevailing scaling laws and suggesting that biologically inspired, shallow architectures can rival larger, deeper models. These results imply a potential shift toward more efficient, cognitively grounded language learning paradigms with substantial practical impact for resource-constrained settings.

Abstract

We show that a tiny Co$^4$ machine(Adeel,2025) with a single layer, two heads, and 8M parameters, operating at an approximate cost of $O(N)$ (where $N$ is the number of input tokens), outpaces the BabyLM Challenge baselines GPT-2 (124M, 12 layers, $O(N^2))$ and GPT-BERT (30M, 12 layers, $O(N^2))$ in just two epochs, while both are trained for ten. Co$^4$ achieves orders-of-magnitude greater training efficiency on 10M tokens, demonstrating highly sample efficient pretraining. Using the BabyLM challenge evaluation pipeline across complex benchmarks, Co$^4$ exhibits strong zero-shot and fine-tuning performance on SuperGLUE tasks. Specifically, Co$^4$ outperforms GPT-2 on 5 out of 7 zero-shot metrics and 6 out of 7 fine-tuning tasks, and GPT-BERT on 4 out of 7 metrics in both cases. These results suggest the need to rethink prevailing deep learning paradigms and associated scaling laws.

Single layer tiny Co$^4$ outpaces GPT-2 and GPT-BERT

TL;DR

The paper introduces Co^4, a tiny single-layer language model with parameters that employs triadic Q-K-V TPNs and two input integration points to achieve linear-time training () versus the quadratic scaling of Transformer baselines. Trained on a 10M-token BabyLM slice, Co^4 achieves competitive zero-shot and SuperGLUE finetuning performance, outperforming GPT-2 and GPT-BERT on multiple metrics in only 2 epochs. The approach demonstrates strong sample efficiency and generalization, challenging prevailing scaling laws and suggesting that biologically inspired, shallow architectures can rival larger, deeper models. These results imply a potential shift toward more efficient, cognitively grounded language learning paradigms with substantial practical impact for resource-constrained settings.

Abstract

We show that a tiny Co machine(Adeel,2025) with a single layer, two heads, and 8M parameters, operating at an approximate cost of (where is the number of input tokens), outpaces the BabyLM Challenge baselines GPT-2 (124M, 12 layers, and GPT-BERT (30M, 12 layers, in just two epochs, while both are trained for ten. Co achieves orders-of-magnitude greater training efficiency on 10M tokens, demonstrating highly sample efficient pretraining. Using the BabyLM challenge evaluation pipeline across complex benchmarks, Co exhibits strong zero-shot and fine-tuning performance on SuperGLUE tasks. Specifically, Co outperforms GPT-2 on 5 out of 7 zero-shot metrics and 6 out of 7 fine-tuning tasks, and GPT-BERT on 4 out of 7 metrics in both cases. These results suggest the need to rethink prevailing deep learning paradigms and associated scaling laws.

Paper Structure

This paper contains 7 sections, 1 figure, 4 tables.

Figures (1)

  • Figure 1: Language Models: GPT-2 (Left) vs. $Co^{4}$ (Right). In $Co^{4}$, the learnable parameters are only in the embedding layer and the initial Q, K, V representations, followed by a single layer of non-parametric triadic modulation loops (referred to as “1x” Co4 or single-layered Co4). $Co^{4}$ does not require feed feed-forward neural network (FFNN/ MLP) layer used in standard GPT-type architectures. Inside these loops, three populations of three pyramidal two-point processors, each associated with Q, K, and V, respectively, simultaneously integrate FF information and FB context at two functionally distinct sites. The apical (top-down) site (shown in the rectangle) integrates context, while FF information is integrated at the basal (bottom-up) site (shown in the triangle). Each processor, via asynchronous modulation (MOD) transfer functions, operating in higher-level perceptual processing (HLPP) or awake thought (AT) mode, depending on the strength of FB, amplifies FF transmission if it is relevant in that context (represented by P, D, U). Otherwise, it attenuates the signal, resulting in the selective amplification of coherent FF information adeel2025beyond. P, D, and U, along with the credit assignment (reward) coming from the higher perceptual layer (teacher), can be seen as dynamic local competitive normalization and global cooperative organisation, respectively. This ensures that local and global coherence and consistency are maximized marvan2024cellular, while prediction error or free energy friston2005theoryfriston2010free is minimized, enabling a deeper form of "real understanding". A combination of three TPNs and one loop constitutes one agent. A set of 12 agents with 12 loops runs in parallel, evolving their Qs, Ks, and Vs simultaneously, before applying latent self-attention at $O(L \times N)$ where L is a small fraction of the input sequence length, making the overall cost approximately $O(N)$.