On The Specialization of Neural Modules

Devon Jarvis; Richard Klein; Benjamin Rosman; Andrew M. Saxe

On The Specialization of Neural Modules

Devon Jarvis, Richard Klein, Benjamin Rosman, Andrew M. Saxe

TL;DR

This work formalizes systematic generalization by separating compositional structure from world-wide structure and introducing a tractable dataset space with input/output blocks $X=[\Omega_x \ \Gamma_x]^T$, $Y=[\Omega_y \ \Gamma_y]^T$. It analyzes training dynamics of deep and shallow linear networks via SVD on covariances $\Sigma^x$ and $\Sigma^{yx}$, showing that learning proceeds along three effective singular-value modes and that dense networks inherently couple compositional and non-compositional sub-structures, hindering systematicity. The authors demonstrate that modular architectures can achieve fully systematic mappings only when the lower-rank, compositional sub-structure is perfectly segregated, and they validate these insights in CMNIST, where a split, modular CNN preserves compositional generalization while a dense network fails. Collectively, the paper clarifies how dataset structure and architectural biases interact to enable or obstruct systematic module specialization, informing design principles for robust, modular generalization in neural networks.

Abstract

A number of machine learning models have been proposed with the goal of achieving systematic generalization: the ability to reason about new situations by combining aspects of previous experiences. These models leverage compositional architectures which aim to learn specialized modules dedicated to structures in a task that can be composed to solve novel problems with similar structures. While the compositionality of these architectures is guaranteed by design, the modules specializing is not. Here we theoretically study the ability of network modules to specialize to useful structures in a dataset and achieve systematic generalization. To this end we introduce a minimal space of datasets motivated by practical systematic generalization benchmarks. From this space of datasets we present a mathematical definition of systematicity and study the learning dynamics of linear neural modules when solving components of the task. Our results shed light on the difficulty of module specialization, what is required for modules to successfully specialize, and the necessity of modular architectures to achieve systematicity. Finally, we confirm that the theoretical results in our tractable setting generalize to more complex datasets and non-linear architectures.

On The Specialization of Neural Modules

TL;DR

This work formalizes systematic generalization by separating compositional structure from world-wide structure and introducing a tractable dataset space with input/output blocks

. It analyzes training dynamics of deep and shallow linear networks via SVD on covariances

and

, showing that learning proceeds along three effective singular-value modes and that dense networks inherently couple compositional and non-compositional sub-structures, hindering systematicity. The authors demonstrate that modular architectures can achieve fully systematic mappings only when the lower-rank, compositional sub-structure is perfectly segregated, and they validate these insights in CMNIST, where a split, modular CNN preserves compositional generalization while a dense network fails. Collectively, the paper clarifies how dataset structure and architectural biases interact to enable or obstruct systematic module specialization, informing design principles for robust, modular generalization in neural networks.

Abstract

Paper Structure (23 sections, 54 equations, 18 figures, 3 tables)

This paper contains 23 sections, 54 equations, 18 figures, 3 tables.

Introduction
Background
A Space of Datasets with Compositional Sub-structure
Systematicity as Exploiting Lower-rank Sub-structure
Learning Dynamics in Shallow and Deep Linear Networks
The evolution of systematicity over learning
Modularity and Network Architecture
Compositional MNIST (CMNIST)
Discussion
Motivating Example
Rank of Compositional Dataset Sub-structures
Learning Dynamics in Deep Linear Networks
Singular Value Decomposition Equations
Proving the Correctness of the SVD
Input and Output Partitioned Frobenius Norms
...and 8 more sections

Figures (18)

Figure 1: Problem setting and dataset space. (a) When navigating a maze towards a target, an agent might extract various input features, mapping these to a sequence of actions. (b) We schematize this setting with a space of datasets containing compositional ($\Omega$) and non-compositional ($\Gamma$) features in the input (middle panel) and output (right panel). Rows contain examples and columns contain features. In this case objects are identified with both a compositional component (based on features: size, shape and colour) and non-compositional component (based on absolute position).
Figure 2: Unlike shallow networks, deep networks show distinct stages of improvement over learning. (Panels a-d): Analytical learning dynamics for deep (a,b) and shallow (c,d) linear networks. (a,c) Comparisons of predicted (dotted) and actual (solid) singular value trajectories over learning, for one dataset's singular values. (b,d) Comparisons of predicted (dotted) and actual (solid) Frobenius norms of the input-output mapping to/from compositional ($\Omega_x,\Omega_y$) and non-compositional ($\Gamma_x,\Gamma_y$) features. Parameters: $n_x=3,k_x=3, n_y=1,k_y=1, r=1.$
Figure 3: Graphical representation of the deep network mapping. The dynamical modes $\pi_1$, $\pi_2$, and $\pi_3$ contain contributions from compositional and non-compositional input components and they make contributions to compositional and non-compositional output components. Systematic portions of the mapping which rely only on compositional sub-structure are depicted as green. To be able to learn a systematic mapping all modes connected to the compositional output (input) component must not connect to the non-compositional input (output) component. (a) For the dense network we see that mode $\pi_1$ is the only mode which impacts the mapping to the compositional output (corresponds to the $\Omega_x\Omega_y$ and $\Gamma_x \Omega_y$ norms in Figure \ref{['fig:dense_norms']} being learned at the same time as the $\pi_1$ mode in Figure \ref{['fig:dense_svs']}), however $\pi_1$ also contributes to the non-compositional output mapping ($\Omega_x \Gamma_y$ and $\Gamma_x \Gamma_y$ norms also rise with $\pi_1$). Thus by learning the systematic mapping some non-systematic mapping is also being learned. (b and c) Impact of architectural biases. Architectures that partition compositional and non-compositional features in different ways with the corresponding graphical representation of the resulting network mappings. The output partitioned network is able to remove the impact of non-compositional output features on mode $\pi_1$ but does not result in a systematic mapping. Only the fully partitioned network achieves systematicity. Comparing to Figure \ref{['fig:architecture_biases']}(a) both graphical representations have less connections, particularly to/from the $\pi_1$ mode, reflecting the inductive bias imposed by modular architectures.
Figure 4:
Figure 5:
...and 13 more figures

Theorems & Definitions (1)

Definition 3.1

On The Specialization of Neural Modules

TL;DR

Abstract

On The Specialization of Neural Modules

Authors

TL;DR

Abstract

Table of Contents

Figures (18)

Theorems & Definitions (1)