Breaking Neural Network Scaling Laws with Modularity
Akhilan Boopathy, Sunshine Jiang, William Yue, Jaedong Hwang, Abhiram Iyer, Ila Fiete
TL;DR
This work addresses the challenge that neural network generalization scales unfavorably with task dimensionality by asking whether modular architectures can break the exponential sample complexity barrier. It offers a simple linear toy model that yields closed‑form expressions for training and test losses, showing that monolithic nets suffer dimension‑dependent scaling while modular structures with bottleneck inputs can render generalization independent of the task dimension $m$. Building on this theory, the authors propose a kernel‑based modular learning rule to align modules with the true task modules and validate it empirically on sine‑wave regression and Compositional CIFAR‑10, achieving better in‑ and out‑of‑distribution generalization than baselines. The results provide a principled explanation for modularity’s benefits and a practical method for recovering underlying task modules, highlighting potential impact for high‑dimensional, compositional problems while acknowledging modeling and optimization limitations that warrant further research.
Abstract
Modular neural networks outperform nonmodular neural networks on tasks ranging from visual question answering to robotics. These performance improvements are thought to be due to modular networks' superior ability to model the compositional and combinatorial structure of real-world problems. However, a theoretical explanation of how modularity improves generalizability, and how to leverage task modularity while training networks remains elusive. Using recent theoretical progress in explaining neural network generalization, we investigate how the amount of training data required to generalize on a task varies with the intrinsic dimensionality of a task's input. We show theoretically that when applied to modularly structured tasks, while nonmodular networks require an exponential number of samples with task dimensionality, modular networks' sample complexity is independent of task dimensionality: modular networks can generalize in high dimensions. We then develop a novel learning rule for modular networks to exploit this advantage and empirically show the improved generalization of the rule, both in- and out-of-distribution, on high-dimensional, modular tasks.
