HyperMLP: An Integrated Perspective for Sequence Modeling
Jiecheng Lu, Shihao Yang
TL;DR
HyperMLP reframes autoregressive attention as a dynamic two-layer MLP whose context-instantiated weights yield an ever-growing hidden width of size $t$ and a memory pool that gates over the prefix. It introduces HyperMLP and HyperGLU with input-conditioned mixing in both feature and sequence spaces using a lag layout, supported by a three-stage memory model: Global Memory Space → Context-instantiated Memory Pool → Current-step Activated Memory. The work provides formal analyses of three-stage memory, warped routing, and extension consistency, and shows empirical wins over strong softmax baselines under matched budgets across language modeling and memory-centric benchmarks. It demonstrates a pathway to richer expressivity in sequence models while maintaining practical autoregressive costs.
Abstract
Self-attention is often viewed as probabilistic query-key lookup, motivating designs that preserve normalized attention scores and fixed positional semantics. We advocate a simpler and more unified perspective: an autoregressive attention head can be viewed as a dynamic two-layer MLP whose weights are instantiated from the context history. From this view, attention scores form an ever-growing hidden representation, and standard MLP activations such as ReLU or GLU naturally implement input-conditioned selection over a context-dependent memory pool rather than a probability distribution. Based on this formulation, we introduce HyperMLP and HyperGLU, which learn dynamic mixing in both feature space and sequence space, using a reverse-offset (lag) layout to align temporal mixing with autoregressive semantics. We provide theoretical characterizations of the expressivity and implications of this structure, and empirically show that HyperMLP/HyperGLU consistently outperform strong softmax-attention baselines under matched parameter budgets.
