Table of Contents
Fetching ...

HyperMLP: An Integrated Perspective for Sequence Modeling

Jiecheng Lu, Shihao Yang

TL;DR

HyperMLP reframes autoregressive attention as a dynamic two-layer MLP whose context-instantiated weights yield an ever-growing hidden width of size $t$ and a memory pool that gates over the prefix. It introduces HyperMLP and HyperGLU with input-conditioned mixing in both feature and sequence spaces using a lag layout, supported by a three-stage memory model: Global Memory Space → Context-instantiated Memory Pool → Current-step Activated Memory. The work provides formal analyses of three-stage memory, warped routing, and extension consistency, and shows empirical wins over strong softmax baselines under matched budgets across language modeling and memory-centric benchmarks. It demonstrates a pathway to richer expressivity in sequence models while maintaining practical autoregressive costs.

Abstract

Self-attention is often viewed as probabilistic query-key lookup, motivating designs that preserve normalized attention scores and fixed positional semantics. We advocate a simpler and more unified perspective: an autoregressive attention head can be viewed as a dynamic two-layer MLP whose weights are instantiated from the context history. From this view, attention scores form an ever-growing hidden representation, and standard MLP activations such as ReLU or GLU naturally implement input-conditioned selection over a context-dependent memory pool rather than a probability distribution. Based on this formulation, we introduce HyperMLP and HyperGLU, which learn dynamic mixing in both feature space and sequence space, using a reverse-offset (lag) layout to align temporal mixing with autoregressive semantics. We provide theoretical characterizations of the expressivity and implications of this structure, and empirically show that HyperMLP/HyperGLU consistently outperform strong softmax-attention baselines under matched parameter budgets.

HyperMLP: An Integrated Perspective for Sequence Modeling

TL;DR

HyperMLP reframes autoregressive attention as a dynamic two-layer MLP whose context-instantiated weights yield an ever-growing hidden width of size and a memory pool that gates over the prefix. It introduces HyperMLP and HyperGLU with input-conditioned mixing in both feature and sequence spaces using a lag layout, supported by a three-stage memory model: Global Memory Space → Context-instantiated Memory Pool → Current-step Activated Memory. The work provides formal analyses of three-stage memory, warped routing, and extension consistency, and shows empirical wins over strong softmax baselines under matched budgets across language modeling and memory-centric benchmarks. It demonstrates a pathway to richer expressivity in sequence models while maintaining practical autoregressive costs.

Abstract

Self-attention is often viewed as probabilistic query-key lookup, motivating designs that preserve normalized attention scores and fixed positional semantics. We advocate a simpler and more unified perspective: an autoregressive attention head can be viewed as a dynamic two-layer MLP whose weights are instantiated from the context history. From this view, attention scores form an ever-growing hidden representation, and standard MLP activations such as ReLU or GLU naturally implement input-conditioned selection over a context-dependent memory pool rather than a probability distribution. Based on this formulation, we introduce HyperMLP and HyperGLU, which learn dynamic mixing in both feature space and sequence space, using a reverse-offset (lag) layout to align temporal mixing with autoregressive semantics. We provide theoretical characterizations of the expressivity and implications of this structure, and empirically show that HyperMLP/HyperGLU consistently outperform strong softmax-attention baselines under matched parameter budgets.
Paper Structure (145 sections, 49 theorems, 220 equations, 5 figures, 5 tables)

This paper contains 145 sections, 49 theorems, 220 equations, 5 figures, 5 tables.

Key Result

Theorem 2.1

Assume $\operatorname{L2Norm}_t(z)=z/\rho_t(z)$ where $\rho_t(z)>0$ is a scalar. The head in eq:hypermlp_ht--eq:hypermlp_w2 satisfies: (i) Dynamic two-layer MLP. (ii) Pool $\to$ activated readout. With $(u_i,v_i)$ and $(\mu^{\mathrm{pool}}_{X,x_t},\mu^{\mathrm{act}}_{X,x_t})$ defined above, See integral notations in eq:mech-alpha-beta. The derivation is in thm:three-stage-memory-multistage-pruni

Figures (5)

  • Figure 1: The integrated attention-as-MLP view: dynamic two-layer MLPs, memory instantiation, and HyperMLP.
  • Figure 2: Attention as a dynamic two-layer MLP: many attention variants can be viewed as edits in the same backbone: $\sigma$ (normalization/gating), QK feature mixing (sharing/compression/structured cores such as RoPE), sequence-axis mixing on $X_{1:t}$ (e.g., convolution/pooling/low-rank mixing), and VO readout (merging/compression/gating).
  • Figure 3: The integrated attention-as-MLP view: Dynamic MLP Heads, 3-stage memory, Lag Layout, and DPLR sequence Mixing.
  • Figure 4: Logical map of our theoretical results.
  • Figure 5: The detailed training loss of the NanoGPT experiment settings. The "training step" corresponds to the evaluation steps with each of them contain 500 steps of training iterations.

Theorems & Definitions (108)

  • Theorem 2.1: Dynamic-head decomposition: MLP form and context-wide slots
  • proof : Proof sketch
  • Corollary 2.2: Warped routing strictly generalizes polyhedral routing
  • proof : Proof sketch
  • Theorem 2.3: Lag layout: extension consistency implies AR truncation invariance
  • proof : Proof sketch
  • Proposition 2.4: HyperGLU decouples routing and magnitude
  • proof : Proof sketch
  • Theorem 2.5: Budget asymmetry in residual two-layer blocks
  • proof : Proof sketch
  • ...and 98 more