Breaking Neural Network Scaling Laws with Modularity

Akhilan Boopathy; Sunshine Jiang; William Yue; Jaedong Hwang; Abhiram Iyer; Ila Fiete

Breaking Neural Network Scaling Laws with Modularity

Akhilan Boopathy, Sunshine Jiang, William Yue, Jaedong Hwang, Abhiram Iyer, Ila Fiete

TL;DR

This work addresses the challenge that neural network generalization scales unfavorably with task dimensionality by asking whether modular architectures can break the exponential sample complexity barrier. It offers a simple linear toy model that yields closed‑form expressions for training and test losses, showing that monolithic nets suffer dimension‑dependent scaling while modular structures with bottleneck inputs can render generalization independent of the task dimension $m$. Building on this theory, the authors propose a kernel‑based modular learning rule to align modules with the true task modules and validate it empirically on sine‑wave regression and Compositional CIFAR‑10, achieving better in‑ and out‑of‑distribution generalization than baselines. The results provide a principled explanation for modularity’s benefits and a practical method for recovering underlying task modules, highlighting potential impact for high‑dimensional, compositional problems while acknowledging modeling and optimization limitations that warrant further research.

Abstract

Modular neural networks outperform nonmodular neural networks on tasks ranging from visual question answering to robotics. These performance improvements are thought to be due to modular networks' superior ability to model the compositional and combinatorial structure of real-world problems. However, a theoretical explanation of how modularity improves generalizability, and how to leverage task modularity while training networks remains elusive. Using recent theoretical progress in explaining neural network generalization, we investigate how the amount of training data required to generalize on a task varies with the intrinsic dimensionality of a task's input. We show theoretically that when applied to modularly structured tasks, while nonmodular networks require an exponential number of samples with task dimensionality, modular networks' sample complexity is independent of task dimensionality: modular networks can generalize in high dimensions. We then develop a novel learning rule for modular networks to exploit this advantage and empirically show the improved generalization of the rule, both in- and out-of-distribution, on high-dimensional, modular tasks.

Breaking Neural Network Scaling Laws with Modularity

TL;DR

. Building on this theory, the authors propose a kernel‑based modular learning rule to align modules with the true task modules and validate it empirically on sine‑wave regression and Compositional CIFAR‑10, achieving better in‑ and out‑of‑distribution generalization than baselines. The results provide a principled explanation for modularity’s benefits and a practical method for recovering underlying task modules, highlighting potential impact for high‑dimensional, compositional problems while acknowledging modeling and optimization limitations that warrant further research.

Abstract

Paper Structure (41 sections, 2 theorems, 79 equations, 12 figures, 4 tables, 2 algorithms)

This paper contains 41 sections, 2 theorems, 79 equations, 12 figures, 4 tables, 2 algorithms.

Introduction
Related Work
Modular Neural Networks
Neural Network Scaling Laws
Modeling Neural Network Generalization
Model Setup
Theoretical Properties
Empirical Validation
Using Modularity to Generalize in High Dimensions
Sample Complexity of Modular Learning
Modular Learning Rule
Experimental Results
Modular NNs empirically generalize better in and out-of-distribution
Our learning rule finds the true task modules
Our learning rule extends to nonlinear module projections
...and 26 more sections

Key Result

Theorem 1

Given a target function $y$ and model $\hat{y}$ estimated as described above, in the limit that $P \to \infty$, the expected test loss when averaging over $x$ and $W$ is: The expected training loss is: with $F(n, p)$ defined as $F(n, p) = \mathbb{E}\left[\left\|R^\dagger\right\|_F^2\right]$ where $R \in \mathbb{R}^{n \times p}$ has elements drawn i.i.d. from $\mathcal{N}(0, 1)$.

Figures (12)

Figure 1: Expected training (left) and test (right) set error in a toy model of NN generalization as a function of the number of samples $n$ and the number of model parameters $p$. The output dimensionality is set as $d=1$.
Figure 2: Empirical trends of training (blue) and test (orange) loss over four parametric variations for a NN trained on a sine wave regression task. The parameters varied are: $k$ (number of modules), $m$ (input dimensionality), $p$ (model size) and $n$ (training set size). In the first two plots, each line indicates a different model architecture, and in the last two plots, each line indicates a different choice of $m$ between $5$ and $9$, with $n/p$ fixed at $1000/1153$ respectively (left/right). The light lines are averaged over all other parameters, and bold lines show averages over the light lines. Dashed lines show theoretical predictions.
Figure 3: Theoretically predicted trend of $m$ vs. $n$ to achieve a test loss of $1.2$ on a sine wave regression task. Each line indicates a different fully connected NN with a different width and depth. $m$ increases approximately exponentially with $n$.
Figure 4: Comparison of our method with baselines of modular and monolithic architectures trained from random initialization on the sine wave regression task (a) and Compositional CIFAR-10 (b). (a): Required training sample size to achieve a desired test error vs. # of input dimensions. Each light line indicates a different model architecture specified in App \ref{['app:experiments']} averaged over five random seeds. The bold lines show averages over the light lines. (b): Accuracy vs. # of component images with a fixed number of training samples. Margins indicate standard errors over five random seeds.
Figure 5: Average cosine similarity between learned and target module directions over training for a modular NN initialized with our method vs. random initialization (baseline).
...and 7 more figures

Theorems & Definitions (4)

Theorem 1
Theorem 2
proof
proof

Breaking Neural Network Scaling Laws with Modularity

TL;DR

Abstract

Breaking Neural Network Scaling Laws with Modularity

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (12)

Theorems & Definitions (4)