Table of Contents
Fetching ...

DDCL-INCRT: A Self-Organising Transformer with Hierarchical Prototype Structure (Theoretical Foundations)

Giansalvo Cirrincione

Abstract

Modern neural networks of the transformer family require the practitioner to decide, before training begins, how many attention heads to use, how deep the network should be, and how wide each component should be. These decisions are made without knowledge of the task, producing architectures that are systematically larger than necessary: empirical studies find that a substantial fraction of heads and layers can be removed after training without performance loss. This paper introduces DDCL-INCRT, an architecture that determines its own structure during training. Two complementary ideas are combined. The first, DDCL (Deep Dual Competitive Learning), replaces the feedforward block with a dictionary of learned prototype vectors representing the most informative directions in the data. The prototypes spread apart automatically, driven by the training objective, without explicit regularisation. The second, INCRT (Incremental Transformer), controls the number of heads: starting from one, it adds a new head only when the directional information uncaptured by existing heads exceeds a threshold. The main theoretical finding is that these two mechanisms reinforce each other: each new head amplifies prototype separation, which in turn raises the signal triggering the next addition. At convergence, the network self-organises into a hierarchy of heads ordered by representational granularity. This hierarchical structure is proved to be unique and minimal, the smallest architecture sufficient for the task, under the stated conditions. Formal guarantees of stability, convergence, and pruning safety are established throughout. The architecture is not something one designs. It is something one derives.

DDCL-INCRT: A Self-Organising Transformer with Hierarchical Prototype Structure (Theoretical Foundations)

Abstract

Modern neural networks of the transformer family require the practitioner to decide, before training begins, how many attention heads to use, how deep the network should be, and how wide each component should be. These decisions are made without knowledge of the task, producing architectures that are systematically larger than necessary: empirical studies find that a substantial fraction of heads and layers can be removed after training without performance loss. This paper introduces DDCL-INCRT, an architecture that determines its own structure during training. Two complementary ideas are combined. The first, DDCL (Deep Dual Competitive Learning), replaces the feedforward block with a dictionary of learned prototype vectors representing the most informative directions in the data. The prototypes spread apart automatically, driven by the training objective, without explicit regularisation. The second, INCRT (Incremental Transformer), controls the number of heads: starting from one, it adds a new head only when the directional information uncaptured by existing heads exceeds a threshold. The main theoretical finding is that these two mechanisms reinforce each other: each new head amplifies prototype separation, which in turn raises the signal triggering the next addition. At convergence, the network self-organises into a hierarchy of heads ordered by representational granularity. This hierarchical structure is proved to be unique and minimal, the smallest architecture sufficient for the task, under the stated conditions. Formal guarantees of stability, convergence, and pruning safety are established throughout. The architecture is not something one designs. It is something one derives.

Paper Structure

This paper contains 76 sections, 56 theorems, 69 equations, 3 figures, 4 tables.

Key Result

Lemma 1

$\Sigma_q \succeq 0$ with $\mathrm{tr}(\Sigma_q) = \sum_n H_2(\mathbf{q}_n)$, where $H_2(\mathbf{q}_n) = \sum_k q_{nk}(1-q_{nk})$ is the Gini diversity of token $n$'s assignments. Furthermore, $\mathrm{tr}(\Sigma_q) = 0$ if and only if all tokens are hard-assigned to a single prototype. $\blacktrian

Figures (3)

  • Figure 1: Synthetic experiments 1--3. Left (Exp. 1): Spectral ordering and directional coverage: $\lambda^{(h)}$ decreasing in $h$ (bars match theoretical eigenvalues, circles); coverage fraction exceeds the theoretical lower bound $1-\varepsilon_{\theta_w}=0.899$ (dashed). Centre (Exp. 2): Temperature divergence: all eight heads converge to $T_{\min}=0.1$ in order of decreasing $\lambda^{(h)}$; Spearman rank correlation $=-1.00$ (prediction: $\leq -0.9$). Right (Exp. 3): Separation force monotonicity: fractional $\mathcal{F}^{(h)}$ strictly decreasing (bars); empirical ratios lie on the diagonal, confirming the bound of Lemma \ref{['lem:sep_monotone']}.
  • Figure 2: Experiment 4 (Colab, BERT/SST-2, frozen encoder, 3 epochs). (a) Directional energy $\Gamma_h$ per head: values range from 0.15 to 0.32, confirming head differentiation on real data. (b)$\max|\mathcal{S}(P^{(h')})\text{ after} - \mathcal{S}(P^{(h')})\text{ before}| = 0$ at all six pruning events (log scale; bars touch $10^{-15}$), verifying Theorem \ref{['thm:P3']}(i) with machine precision. (c) Actual $F_{\rm sep}$ drop (blue) equals $F_{\rm sep}$ of the pruned head (red) exactly at every event (all labelled OK), verifying Lemma \ref{['lem:proto_orth']}: prototype subspace orthogonality makes the drop decompose exactly.
  • Figure 3: Experiment 5 (end-to-end, SST-2, $d=64$, $\theta_w=0.8$, $\lambda_{\rm reg}=0.05$, 10 epochs). Left: training loss and validation accuracy over 10 epochs; final validation accuracy $69.4\%$. Centre: residual content $\lambda_{\max}(A_{\rm res})$ at the five growth events, strictly decreasing ($4.36 \to 2.38 \to 2.15 \to 1.35 \to 0.84$), confirming the spectral ordering prediction. Right: separation force $F_{\rm sep}^{(h)}$ (bars) and prototype spread $\mathcal{S}(P^{(h)})$ (green curve) per head; $\mathcal{S}(P^{(h)}) > 0$ for all heads confirms assumption (A1); $F_{\rm sep}$ is partially ordered, with deviations attributable to end-to-end co-adaptation after growth events.

Theorems & Definitions (105)

  • Lemma 1: Structure of $\Sigma_q$
  • proof
  • Lemma 2: First-order variation of $\Sigma_q$
  • proof
  • Lemma 3: First-order variation of assignment weights
  • proof
  • Lemma 4: Sign of the residual-covariance term
  • proof
  • Theorem 1: Monotonicity of $F_{\mathrm{sep}}$ under INCRT expansion
  • proof : Proof, first-order argument
  • ...and 95 more