Feature emergence via margin maximization: case studies in algebraic tasks

Depen Morwani; Benjamin L. Edelman; Costin-Andrei Oncescu; Rosie Zhao; Sham Kakade

Feature emergence via margin maximization: case studies in algebraic tasks

Depen Morwani, Benjamin L. Edelman, Costin-Andrei Oncescu, Rosie Zhao, Sham Kakade

TL;DR

This paper proves that the trained networks utilize Fourier features to perform modular addition and employ features corresponding to irreducible group-theoretic representations to perform compositions in general groups, aligning closely with the empirical observations of Nanda et al. and Chughtai etAl.

Abstract

Understanding the internal representations learned by neural networks is a cornerstone challenge in the science of machine learning. While there have been significant recent strides in some cases towards understanding how neural networks implement specific target functions, this paper explores a complementary question -- why do networks arrive at particular computational strategies? Our inquiry focuses on the algebraic learning tasks of modular addition, sparse parities, and finite group operations. Our primary theoretical findings analytically characterize the features learned by stylized neural networks for these algebraic tasks. Notably, our main technique demonstrates how the principle of margin maximization alone can be used to fully specify the features learned by the network. Specifically, we prove that the trained networks utilize Fourier features to perform modular addition and employ features corresponding to irreducible group-theoretic representations to perform compositions in general groups, aligning closely with the empirical observations of Nanda et al. and Chughtai et al. More generally, we hope our techniques can help to foster a deeper understanding of why neural networks adopt specific computational strategies.

Feature emergence via margin maximization: case studies in algebraic tasks

TL;DR

Abstract

Paper Structure (39 sections, 29 theorems, 113 equations, 9 figures, 1 table)

This paper contains 39 sections, 29 theorems, 113 equations, 9 figures, 1 table.

Introduction
Nanda et al.'s striking observations.
Our Contributions.
Preliminaries
Theoretical Approach
Binary Classification
Multi-Class Classification
Blueprint for the case studies
Cyclic groups (modular addition)
Sparse parity
Finite Groups with Real Representations
Brief Background and Notation
The Main Result
Discussion
Acknowledgments
...and 24 more sections

Key Result

Theorem 1

For any norm $\| \cdot \|$, a fixed $r > 0$ and any homogeneous function $f$ with homogeneity constant $\nu > 0$, if $\gamma^* > 0$, then $\lim_{\lambda\to 0}\gamma_\lambda=\gamma^*$.

Figures (9)

Figure 1: (a) Final trained embeddings and their Fourier power spectrum for a 1-hidden layer ReLU network trained on a mod-71 addition dataset with $L_2$ regularization. Each row corresponds to an arbitrary neuron from the trained network. The red dots represent the actual value of the weights, while the light blue interpolation is obtained by finding the function over the reals with the same Fourier spectrum as the weight vector. (b) Similar plot for 1-hidden layer quadratic activation, trained with $L_{2,3}$ regularization (Section \ref{['sec:prelim']}) (c) For the quadratic activation, the network asymptotically reaches the maximum $L_{2,3}$ margin predicted by our analysis.
Figure 2: An illustration of an individual neuron $\phi(\{u, v, w\}, a, b)$ (left) and the resulting one hidden layer neural network $f(\theta, a, b)$ (right) with quadratic activations.
Figure 3: A schematic illustration of the relation between class-weighted margin $g'$ and maximum margin $g$.
Figure 4: The maximum normalized power of the embedding vector of a neuron is given by $\max_i |\hat{u}[i]|^2/(\sum |\hat{u}[j]|^2)$, where $\hat{u}[i]$ represents the $i^{th}$ component of the Fourier transform of $u$. (a) Initially, the maximum power is randomly distributed. (b) For 1-hidden layer ReLU network trained with $L_2$ regularization, the final distribution of maximum power seems to be concentrated around 0.9, meaning neurons are nearly 1-sparse in frequency space but not quite. (c) For 1-hidden layer quadratic network trained with $L_{2,3}$ regularization, the final maximum power is almost exactly 1 for all the neurons, so the embeddings are 1-sparse in frequency space, as predicted by the maximum margin analysis.
Figure 5: Final neurons with highest norm and the evolution of normalized $L_{2,5}$ margin over training of a 1-hidden layer quartic network (activation $x^4$) on $(10,4)$ sparse parity dataset with $L_{2,5}$ regularization. The network approaches the theoretical maximum margin that we predict.
...and 4 more figures

Theorems & Definitions (50)

Theorem 1: Wei19, Theorem 4.1
Lemma 2
Lemma 3
Lemma 4
Lemma 5
Lemma 6
Theorem 7
proof : Proof outline
Theorem 8
Theorem 9
...and 40 more

Feature emergence via margin maximization: case studies in algebraic tasks

TL;DR

Abstract

Feature emergence via margin maximization: case studies in algebraic tasks

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (50)