Table of Contents
Fetching ...

Modular Duality in Deep Learning

Jeremy Bernstein, Laker Newhouse

TL;DR

Modular dualization introduces a principled, recursive framework to build duality maps for general neural architectures by assigning layerwise operator norms, deriving atomic duality maps (Linear, Embed, Conv2D), and propagating them through composition and concatenation via a modular norm. This yields GPU-friendly, all-layer updates that unify fast and scalable optimization approaches, reconciling prior methods like μP and Shampoo as partial realizations of a single RMS→RMS-based duality map. The paper provides practical deployment tools, including sketching and rectangular Newton-Schulz iterations, and demonstrates speedups such as NanoGPT, while outlining a broader vision for a type-system perspective and activation–update alignment. Overall, modular dualization offers a unifying theoretical framework and practical toolkit for next-generation optimizers across diverse neural architectures.

Abstract

An old idea in optimization theory says that since the gradient is a dual vector it may not be subtracted from the weights without first being mapped to the primal space where the weights reside. We take this idea seriously in this paper and construct such a duality map for general neural networks. Our map, which we call modular dualization, forms a unifying theoretical basis for training algorithms that are a) fast and b) scalable. Modular dualization involves first assigning operator norms to layers based on the semantics of each layer, and then using these layerwise norms to recursively induce a duality map on the weight space of the full neural architecture. We conclude by deriving GPU-friendly algorithms for dualizing Embed, Linear and Conv2D layers -- the latter two methods are based on a rectangular Newton-Schulz iteration (Kovarik, 1970; Björck & Bowie, 1971). A variant of our methods was used to set speed records for training NanoGPT. Overall, we hope that our theory of modular duality will yield a next generation of fast and scalable optimizers for general neural architectures.

Modular Duality in Deep Learning

TL;DR

Modular dualization introduces a principled, recursive framework to build duality maps for general neural architectures by assigning layerwise operator norms, deriving atomic duality maps (Linear, Embed, Conv2D), and propagating them through composition and concatenation via a modular norm. This yields GPU-friendly, all-layer updates that unify fast and scalable optimization approaches, reconciling prior methods like μP and Shampoo as partial realizations of a single RMS→RMS-based duality map. The paper provides practical deployment tools, including sketching and rectangular Newton-Schulz iterations, and demonstrates speedups such as NanoGPT, while outlining a broader vision for a type-system perspective and activation–update alignment. Overall, modular dualization offers a unifying theoretical framework and practical toolkit for next-generation optimizers across diverse neural architectures.

Abstract

An old idea in optimization theory says that since the gradient is a dual vector it may not be subtracted from the weights without first being mapped to the primal space where the weights reside. We take this idea seriously in this paper and construct such a duality map for general neural networks. Our map, which we call modular dualization, forms a unifying theoretical basis for training algorithms that are a) fast and b) scalable. Modular dualization involves first assigning operator norms to layers based on the semantics of each layer, and then using these layerwise norms to recursively induce a duality map on the weight space of the full neural architecture. We conclude by deriving GPU-friendly algorithms for dualizing Embed, Linear and Conv2D layers -- the latter two methods are based on a rectangular Newton-Schulz iteration (Kovarik, 1970; Björck & Bowie, 1971). A variant of our methods was used to set speed records for training NanoGPT. Overall, we hope that our theory of modular duality will yield a next generation of fast and scalable optimizers for general neural architectures.

Paper Structure

This paper contains 22 sections, 1 theorem, 14 equations, 1 table.

Key Result

Proposition 1

For any ${\bm{g}} \in \mathbb{R}^n$ thought of as "the gradient", any $\lambda \geq 0$ thought of as "the sharpness", and any norm $\Vert {\cdot} \Vert:\mathbb{R}^n\to\mathbb{R}$ with dual norm $\Vert {\cdot} \Vert^\dagger$ and duality map $\operatorname{dualize}_{\Vert {\cdot} \Vert}$:

Theorems & Definitions (15)

  • Definition 1: Dual norm
  • Definition 2: Duality map based on a norm
  • Proposition 1: Steepest descent under a norm
  • Example 1: Duality map for the Euclidean norm
  • Example 2: Duality map for the infinity norm
  • Definition 3: Induced operator norm
  • Example 3: Duality map for the $\mathrm{RMS} \to \mathrm{RMS}$ operator norm
  • Example 4: Duality map for the $\ell_1 \to \mathrm{RMS}$ operator norm
  • Definition 4: Module
  • Definition 5: Well-normed module
  • ...and 5 more