Subgroups of $U(d)$ Induce Natural RNN and Transformer Architectures

Joshua Nunley

Subgroups of $U(d)$ Induce Natural RNN and Transformer Architectures

Joshua Nunley

TL;DR

A minimal axiomatic setup is used and recurrent and transformer templates from a shared skeleton in which subgroup choice acts as a drop-in replacement for state space, tangent projection, and update map and a general linear-mixing extension in tangent space is reported.

Abstract

This paper presents a direct framework for sequence models with hidden states on closed subgroups of U(d). We use a minimal axiomatic setup and derive recurrent and transformer templates from a shared skeleton in which subgroup choice acts as a drop-in replacement for state space, tangent projection, and update map. We then specialize to O(d) and evaluate orthogonal-state RNN and transformer models on Tiny Shakespeare and Penn Treebank under parameter-matched settings. We also report a general linear-mixing extension in tangent space, which applies across subgroup choices and improves finite-budget performance in the current O(d) experiments.

Subgroups of $U(d)$ Induce Natural RNN and Transformer Architectures

TL;DR

Abstract

Paper Structure (31 sections, 29 equations, 3 figures, 8 tables)

This paper contains 31 sections, 29 equations, 3 figures, 8 tables.

Introduction
Related Work and Positioning
Orthogonal and unitary RNNs.
Manifold-valued states and group-aware attention.
This work.
Framework and Axioms
General RNN and Transformer Templates
RNN template
Transformer template
Optional tangent mixing component
Subgroup Instantiations as Drop-In Components
Scope of experiments
Experiments on O(d) Models
Evaluated equations and chosen approximations
Parameter-matched comparisons
...and 16 more sections

Figures (3)

Figure 1: Tiny Shakespeare scaling curve corresponding to Table \ref{['tab:ts_scaling_summary']}.
Figure 2: Penn Treebank scaling curve corresponding to Table \ref{['tab:ptb_scaling_summary']}.
Figure 3: Optimizer robustness summary on Tiny Shakespeare at 500K parameters (2-layer and 4-layer settings).

Subgroups of $U(d)$ Induce Natural RNN and Transformer Architectures

TL;DR

Abstract

Subgroups of $U(d)$ Induce Natural RNN and Transformer Architectures

Authors

TL;DR

Abstract

Table of Contents

Figures (3)