Table of Contents
Fetching ...

Make Haste Slowly: A Theory of Emergent Structured Mixed Selectivity in Feature Learning ReLU Networks

Devon Jarvis, Richard Klein, Benjamin Rosman, Andrew M. Saxe

TL;DR

This paper develops a theory of feature learning in finite-width ReLU networks by mapping them to GDLNs through the Rectified Linear Network (ReLN), enabling full analytic training dynamics. It shows that ReLU networks exhibit an inductive bias toward structured mixed-selective latent representations that are reusable across contexts, a bias that strengthens with more contexts and deeper networks. The results reveal a unique, fastest ReLN mapping that mimics ReLU loss trajectories and uncover modular pathways that couple inputs and outputs via context-sensitive gating, with singular-value dynamics tracing the learning process. The work provides a principled explanation for the emergence of reusable, mixed-selective structure during slow feature learning and offers a framework for understanding how such representations scale with task complexity and depth.

Abstract

In spite of finite dimension ReLU neural networks being a consistent factor behind recent deep learning successes, a theory of feature learning in these models remains elusive. Currently, insightful theories still rely on assumptions including the linearity of the network computations, unstructured input data and architectural constraints such as infinite width or a single hidden layer. To begin to address this gap we establish an equivalence between ReLU networks and Gated Deep Linear Networks, and use their greater tractability to derive dynamics of learning. We then consider multiple variants of a core task reminiscent of multi-task learning or contextual control which requires both feature learning and nonlinearity. We make explicit that, for these tasks, the ReLU networks possess an inductive bias towards latent representations which are not strictly modular or disentangled but are still highly structured and reusable between contexts. This effect is amplified with the addition of more contexts and hidden layers. Thus, we take a step towards a theory of feature learning in finite ReLU networks and shed light on how structured mixed-selective latent representations can emerge due to a bias for node-reuse and learning speed.

Make Haste Slowly: A Theory of Emergent Structured Mixed Selectivity in Feature Learning ReLU Networks

TL;DR

This paper develops a theory of feature learning in finite-width ReLU networks by mapping them to GDLNs through the Rectified Linear Network (ReLN), enabling full analytic training dynamics. It shows that ReLU networks exhibit an inductive bias toward structured mixed-selective latent representations that are reusable across contexts, a bias that strengthens with more contexts and deeper networks. The results reveal a unique, fastest ReLN mapping that mimics ReLU loss trajectories and uncover modular pathways that couple inputs and outputs via context-sensitive gating, with singular-value dynamics tracing the learning process. The work provides a principled explanation for the emergence of reusable, mixed-selective structure during slow feature learning and offers a framework for understanding how such representations scale with task complexity and depth.

Abstract

In spite of finite dimension ReLU neural networks being a consistent factor behind recent deep learning successes, a theory of feature learning in these models remains elusive. Currently, insightful theories still rely on assumptions including the linearity of the network computations, unstructured input data and architectural constraints such as infinite width or a single hidden layer. To begin to address this gap we establish an equivalence between ReLU networks and Gated Deep Linear Networks, and use their greater tractability to derive dynamics of learning. We then consider multiple variants of a core task reminiscent of multi-task learning or contextual control which requires both feature learning and nonlinearity. We make explicit that, for these tasks, the ReLU networks possess an inductive bias towards latent representations which are not strictly modular or disentangled but are still highly structured and reusable between contexts. This effect is amplified with the addition of more contexts and hidden layers. Thus, we take a step towards a theory of feature learning in finite ReLU networks and shed light on how structured mixed-selective latent representations can emerge due to a bias for node-reuse and learning speed.

Paper Structure

This paper contains 19 sections, 2 theorems, 84 equations, 9 figures, 1 table, 1 algorithm.

Key Result

Lemma 4.1

Given the definition of GDLNs and ReLU Networks, for any gating pattern implementable by the ReLU network there is $G(x_c)$ such that $G(x_c)_{j} = step(\overline{h}_{j}) \text{ or } G(x)_{j} = \mathbbm{1}(x)\ \forall\ j \in \{0,1,...,H\}$ where $\overline{h}$ is the pre-activations of the ReLU netw

Figures (9)

  • Figure 1: GDLN Formalism and notation. a) The GDLN applies gating variables to nodes ($g_v$) and edges ($g_q$) in an otherwise linear network. b) The gradient for an edge (using the red edge in (a) for example) can be written in terms of paths through that edge (colored lines). Each path is broken into the component preceding ($\bar{s}(p,e)$) and following ($\bar{t}(p,e)$) the edge.
  • Figure 2: Dynamics of ReLU networks during transition to nonlinear separability. (a) A simple dataset which has XoR structure in the first two dimensions, and is linearly separable with margin $2\Delta$ in the third. (b) GDLN exploiting linear structure. The network contains two pathways, one gated on only for positive examples and one gated on only for negative examples. Blue arrows in panel (a) depict ReLU weight directions that achieve this gating. (c) GDLN exploiting XoR structure. The network contains four pathways, each active on exactly one example. Orange arrows in panel (a) depict ReLU weight directions that achieve this gating. (d) Time to learn to a fixed criterion (loss=.2) calculated analytically for GDLNs with linear and XoR gating structure (blue and orange, respectively), and in simulation of ReLU networks (green). The ReLU network behaves like the faster of the two gated networks. Which gating structure is fastest changes at $\Delta=\sqrt{2/3}$ (grey dashed). (e) Analytical loss trajectories for the gated networks, and simulated ReLU networks for several values of $\Delta$. The full trajectories of the simulation match the faster of the gated networks. Parameters: learning rate $1/\tau=.4$, $N_h=128$ hidden units, initialization variance $4\cdot 10^{-8}/N_h$.
  • Figure 3: ataset used to train the ReLU network (left) and ReLN architecture used to imitate it (right). Inputs (middle matrix) are created by appending a one hot vector encoding object identity to a one hot vector encoding context such that each item appears in all contexts. Target outputs (left and right matrices) contain some context-independent (top block) and some context specific properties (bottom three blocks). These datasets broadly follow a hierarchical structure across items (hierarchical tree depicted in the middle over input datapoint along the columns), but with some variation in each context-specific block. All structures are taken from saxe2019mathematical. The analysis in this work shows that the ReLU network dynamics arise from four implicit modules, made explicit by the ReLN pathways towards the right, which receive different subsets of inputs and generate different subsets of outputs. Together these graded mixed-selective pathways couple together to produce the correct output labels for each object. While each context-specific pathway is only on in two contexts (blocks of columns) they still produce labels for all three context-specific parts of the output space (blocks of rows). This creates errors which other pathways learn to remove. If this fine balance of activity between pathways is broken then errors will be incurred.
  • Figure 4: Summary of results for the ReLN imitating a ReLU network on a contextual nonlinear task. (a) Comparison of loss trajectories between the ReLU, empirical and predicted ReLN, and alternate GDLN which does not imitate the ReLU network loss trajectory. We find that a GDLN (here called "GDLN Single") which has contextual pathways active for individual contexts is unable to imitate the loss trajectory of the ReLU network. This is used in Proposition \ref{['prop:unique_mapping']} to prove the uniqueness of the ReLN which we have identified. Example outputs from the ReLU network and ReLN are also shown and we see exact agreement between these output samples. (b) Singular value dynamics for the ReLN architecture using the neural race reduction dynamics. (c) Multi-dimensional Scaling which compares the relative latent representations of the ReLU network and ReLN over time. Both architectures demonstrate an equivalent latent representation at all points in time.
  • Figure 5: Effect of increasing the number of contexts on mixed selectivity: Summary of the loss trajectory (first row) for the ReLU network and corresponding ReLN as the number of contexts increases (columns). Due to the symmetry of the task we are also able to continue the derivation of the ReLN dynamics beyond the neural race reduction and obtain closed form trajectories for the networks' singular values. The singular value trajectories for the common pathway (second row) and mean trajectories for the contextual pathways (bottom row) match simulations exactly. Consequently we can in closed form derive the loss trajectory for each architecture and see perfect agreement to both the ReLN and ReLU networks. All together these results demonstrate that as the number of contexts increase, so too do the number of contexts a pathway is active for.
  • ...and 4 more figures

Theorems & Definitions (3)

  • Definition 3.1
  • Lemma 4.1
  • Proposition 4.2