HyperMLP: An Integrated Perspective for Sequence Modeling

Jiecheng Lu; Shihao Yang

HyperMLP: An Integrated Perspective for Sequence Modeling

Jiecheng Lu, Shihao Yang

TL;DR

HyperMLP reframes autoregressive attention as a dynamic two-layer MLP whose context-instantiated weights yield an ever-growing hidden width of size $t$ and a memory pool that gates over the prefix. It introduces HyperMLP and HyperGLU with input-conditioned mixing in both feature and sequence spaces using a lag layout, supported by a three-stage memory model: Global Memory Space → Context-instantiated Memory Pool → Current-step Activated Memory. The work provides formal analyses of three-stage memory, warped routing, and extension consistency, and shows empirical wins over strong softmax baselines under matched budgets across language modeling and memory-centric benchmarks. It demonstrates a pathway to richer expressivity in sequence models while maintaining practical autoregressive costs.

Abstract

Self-attention is often viewed as probabilistic query-key lookup, motivating designs that preserve normalized attention scores and fixed positional semantics. We advocate a simpler and more unified perspective: an autoregressive attention head can be viewed as a dynamic two-layer MLP whose weights are instantiated from the context history. From this view, attention scores form an ever-growing hidden representation, and standard MLP activations such as ReLU or GLU naturally implement input-conditioned selection over a context-dependent memory pool rather than a probability distribution. Based on this formulation, we introduce HyperMLP and HyperGLU, which learn dynamic mixing in both feature space and sequence space, using a reverse-offset (lag) layout to align temporal mixing with autoregressive semantics. We provide theoretical characterizations of the expressivity and implications of this structure, and empirically show that HyperMLP/HyperGLU consistently outperform strong softmax-attention baselines under matched parameter budgets.

HyperMLP: An Integrated Perspective for Sequence Modeling

TL;DR

HyperMLP reframes autoregressive attention as a dynamic two-layer MLP whose context-instantiated weights yield an ever-growing hidden width of size

and a memory pool that gates over the prefix. It introduces HyperMLP and HyperGLU with input-conditioned mixing in both feature and sequence spaces using a lag layout, supported by a three-stage memory model: Global Memory Space → Context-instantiated Memory Pool → Current-step Activated Memory. The work provides formal analyses of three-stage memory, warped routing, and extension consistency, and shows empirical wins over strong softmax baselines under matched budgets across language modeling and memory-centric benchmarks. It demonstrates a pathway to richer expressivity in sequence models while maintaining practical autoregressive costs.

Abstract

Paper Structure (145 sections, 49 theorems, 220 equations, 5 figures, 5 tables)

This paper contains 145 sections, 49 theorems, 220 equations, 5 figures, 5 tables.

Introduction
Redesigning Attention from First Principles
ReLU MLPs select input-conditioned sub-networks
Attention as a dynamic two-layer MLP
Limitation of the classical form: a fixed positional basis in the hidden space
HyperMLP: learning sequence mixing effectively
The Attention-as-MLP View: Pool instantiation, routing geometry, and autoregressive consistency
Parameter budgeting and efficient implementation
Empirical Evaluation
3.a. Controlled Design Study
Conclusion and Limitations
Overview of the Supporting Theoretical Results in the Appendix
Notation and Common Setup
Row-vector convention and indexing.
Lag (reverse-offset) layout.
...and 130 more sections

Key Result

Theorem 2.1

Assume $\operatorname{L2Norm}_t(z)=z/\rho_t(z)$ where $\rho_t(z)>0$ is a scalar. The head in eq:hypermlp_ht--eq:hypermlp_w2 satisfies: (i) Dynamic two-layer MLP. (ii) Pool $\to$ activated readout. With $(u_i,v_i)$ and $(\mu^{\mathrm{pool}}_{X,x_t},\mu^{\mathrm{act}}_{X,x_t})$ defined above, See integral notations in eq:mech-alpha-beta. The derivation is in thm:three-stage-memory-multistage-pruni

Figures (5)

Figure 1: The integrated attention-as-MLP view: dynamic two-layer MLPs, memory instantiation, and HyperMLP.
Figure 2: Attention as a dynamic two-layer MLP: many attention variants can be viewed as edits in the same backbone: $\sigma$ (normalization/gating), QK feature mixing (sharing/compression/structured cores such as RoPE), sequence-axis mixing on $X_{1:t}$ (e.g., convolution/pooling/low-rank mixing), and VO readout (merging/compression/gating).
Figure 3: The integrated attention-as-MLP view: Dynamic MLP Heads, 3-stage memory, Lag Layout, and DPLR sequence Mixing.
Figure 4: Logical map of our theoretical results.
Figure 5: The detailed training loss of the NanoGPT experiment settings. The "training step" corresponds to the evaluation steps with each of them contain 500 steps of training iterations.

Theorems & Definitions (108)

Theorem 2.1: Dynamic-head decomposition: MLP form and context-wide slots
proof : Proof sketch
Corollary 2.2: Warped routing strictly generalizes polyhedral routing
proof : Proof sketch
Theorem 2.3: Lag layout: extension consistency implies AR truncation invariance
proof : Proof sketch
Proposition 2.4: HyperGLU decouples routing and magnitude
proof : Proof sketch
Theorem 2.5: Budget asymmetry in residual two-layer blocks
proof : Proof sketch
...and 98 more

HyperMLP: An Integrated Perspective for Sequence Modeling

TL;DR

Abstract

HyperMLP: An Integrated Perspective for Sequence Modeling

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (108)