Scalable Optimization in the Modular Norm

Tim Large; Yang Liu; Minyoung Huh; Hyojin Bahng; Phillip Isola; Jeremy Bernstein

Scalable Optimization in the Modular Norm

Tim Large, Yang Liu, Minyoung Huh, Hyojin Bahng, Phillip Isola, Jeremy Bernstein

TL;DR

This paper introduces the modular norm, an architecture-aware, recursively defined norm on full neural-network weight spaces, enabling scalable training as width and depth grow. By normalizing updates in this norm, the authors achieve learning-rate transfer across scale and provide theoretical guarantees that gradients are Lipschitz in the modular norm. They implement Modula, a Python package that constructs modular norms and normalizes base optimizers, and demonstrate improved scale-up performance on GPT-like models and vision architectures. The work bridges practical optimization with theory by deriving sharpness and smoothness results in the modular-norm setting and offering a library to apply these ideas in real models. The proposed approach promises more stable, scalable training and invites further exploration of normed optimization and mass-allocation strategies.

Abstract

To improve performance in contemporary deep learning, one is interested in scaling up the neural network in terms of both the number and the size of the layers. When ramping up the width of a single layer, graceful scaling of training has been linked to the need to normalize the weights and their updates in the "natural norm" particular to that layer. In this paper, we significantly generalize this idea by defining the modular norm, which is the natural norm on the full weight space of any neural network architecture. The modular norm is defined recursively in tandem with the network architecture itself. We show that the modular norm has several promising applications. On the practical side, the modular norm can be used to normalize the updates of any base optimizer so that the learning rate becomes transferable across width and depth. This means that the user does not need to compute optimizer-specific scale factors in order to scale training. On the theoretical side, we show that for any neural network built from "well-behaved" atomic modules, the gradient of the network is Lipschitz-continuous in the modular norm, with the Lipschitz constant admitting a simple recursive formula. This characterization opens the door to porting standard ideas in optimization theory over to deep learning. We have created a Python package called Modula that automatically normalizes weight updates in the modular norm of the architecture. The package is available via "pip install modula" with source code at https://github.com/jxbz/modula.

Scalable Optimization in the Modular Norm

TL;DR

Abstract

Paper Structure (44 sections, 9 theorems, 140 equations, 12 figures, 2 tables)

This paper contains 44 sections, 9 theorems, 140 equations, 12 figures, 2 tables.

Introduction
Related work
Metrization
Asymptotics
Majorization
Descent in Normed Spaces
What's in a norm?
Preview of the modular norm
Normed optimization
Constructing the Modular Norm
Modules
Compound modules: Building new modules from old
Mass allocation in compound modules
Smoothness in the modular norm
Experiments
...and 29 more sections

Key Result

Proposition 1

If modules $\mathsf{M}_1, \mathsf{M}_2, \mathsf{M}_3$ are successively composable, then $\mathsf{M}_3 \circ (\mathsf{M}_2 \circ \mathsf{M}_1)$ equals $(\mathsf{M}_3 \circ \mathsf{M}_2) \circ \mathsf{M}_1$ in all attributes. If modules $\mathsf{M}_1, \mathsf{M}_2, \mathsf{M}_3$ are mutually concatena

Figures (12)

Figure 1: Learning rate transfer in the modular norm. We train GPT with context length 128 for 10k steps on OpenWebText. Left: Learning rate sweeps for normed Adam (Adam with updates normalized in the modular norm) with three transformer blocks and varying width. The optimal learning rate (marked by red dots) transfers well across scales. Mid-left: The same, but varying the number of blocks at width 128. Mid-right: Comparing normed versus unnormed Adam and SGD at fixed learning rate and varying width. For each method, we tune the learning rate at the scale marked by the dotted line. The normed methods scale better. Right: The same, but scaling number of blocks.
Figure 2: Modules and trees of modules. A module is an object that maps an input and a weight vector to an output. Left: In addition to the standard forward function, our modules are endowed with two numbers---a mass and sensitivity---and a norm. Middle: New compound modules are built via the binary operations of composition and concatenation. We provide rules for composing and concatenating all module attributes. Right: Compound modules are binary trees, where the leaves are modules and the internal nodes compose and concatenate their children. Here we illustrate a sum of modules, which leverages a special utility module $\mathsf{Add}$---see \ref{['tab:operations']} for more on this.
Figure 3: Exploring mass allocation. We tune the total mass of the hidden layers, training with normed Adam. Left group: Learning rate sweeps for ResMLP on CIFAR-10, for varying depth and mass. The bottom right subplot reports the best train loss at each mass and depth. Mass 0.5 is best at all depths. Right group: Learning rate sweeps for GPT on OpenWebText, for varying mass. Both optimal mass and learning rate transferred from the small model (top) to the large model (bottom).
Figure 4: Learning rate transfer on CIFAR-10. We tune the learning rate on a small model---at the scale marked by the dotted line---and test the performance on models of increasing width and depth at this fixed learning rate. We find that normed Adam and SGD scale better than their unnormed counterparts on both ResMLPs and ResNets. See \ref{['fig:openwebtext-transfer']} for the same experiment on GPT.
Figure 5: Comparing to a standard transformer implementation. Since we used our own well-normed GPT implementation for the experiments in this paper (here referred to as modulaGPT) we wanted to check its performance was on par with a standard nanoGPT implementation. These plots show learning rate sweeps for varying width and depth for Adam on nanoGPT, as well as Adam and normed Adam on modulaGPT. Even without normed updates, the architectural changes and orthogonal initialization used in Modula seem to already improve transfer compared to nanoGPT.
...and 7 more figures

Theorems & Definitions (18)

Definition 1: Module
Definition 2: Well-normed
Definition 3: Module composition
Definition 4: Module concatenation
Proposition 1: Composition and concatenation are associative
Proposition 2: Composition and concatenation preserve well-normedness
Proposition 3: Feature learning is apportioned by mass
Definition 5: Module sharpness
Proposition 4
Proposition 5: Loss functions are smooth in the modular norm
...and 8 more

Scalable Optimization in the Modular Norm

TL;DR

Abstract

Scalable Optimization in the Modular Norm

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (12)

Theorems & Definitions (18)