Table of Contents
Fetching ...

Task structure and nonlinearity jointly determine learned representational geometry

Matteo Alleman, Jack W Lindsey, Stefano Fusi

TL;DR

The paper investigates how the geometry of inputs and targets, together with neural nonlinearities, shapes representations learned by one-hidden-layer networks. Using controlled synthetic tasks, it demonstrates that tanh tends to align hidden representations with target structure, while ReLU preserves input geometry, a result explained by the asymmetric gradient updates arising from the activations. The study extends to multi-layer and convolutional networks, showing the activation-function-driven geometry persists across depth and realistic data, and identifies two mechanisms—symmetric saturating asymptotics and origin behavior—that influence this effect. These findings illuminate a fundamental tradeoff between disentangled, transferable representations and input-information-rich representations, with implications for architecture design and transfer learning. Overall, the work provides a quantitative framework linking input/output geometry, nonlinearity, and learned representations, with metrics like kernel alignment, PS, and CCGP to evaluate geometry and generalization across tasks and architectures.

Abstract

The utility of a learned neural representation depends on how well its geometry supports performance in downstream tasks. This geometry depends on the structure of the inputs, the structure of the target outputs, and the architecture of the network. By studying the learning dynamics of networks with one hidden layer, we discovered that the network's activation function has an unexpectedly strong impact on the representational geometry: Tanh networks tend to learn representations that reflect the structure of the target outputs, while ReLU networks retain more information about the structure of the raw inputs. This difference is consistently observed across a broad class of parameterized tasks in which we modulated the degree of alignment between the geometry of the task inputs and that of the task labels. We analyzed the learning dynamics in weight space and show how the differences between the networks with Tanh and ReLU nonlinearities arise from the asymmetric asymptotic behavior of ReLU, which leads feature neurons to specialize for different regions of input space. By contrast, feature neurons in Tanh networks tend to inherit the task label structure. Consequently, when the target outputs are low dimensional, Tanh networks generate neural representations that are more disentangled than those obtained with a ReLU nonlinearity. Our findings shed light on the interplay between input-output geometry, nonlinearity, and learned representations in neural networks.

Task structure and nonlinearity jointly determine learned representational geometry

TL;DR

The paper investigates how the geometry of inputs and targets, together with neural nonlinearities, shapes representations learned by one-hidden-layer networks. Using controlled synthetic tasks, it demonstrates that tanh tends to align hidden representations with target structure, while ReLU preserves input geometry, a result explained by the asymmetric gradient updates arising from the activations. The study extends to multi-layer and convolutional networks, showing the activation-function-driven geometry persists across depth and realistic data, and identifies two mechanisms—symmetric saturating asymptotics and origin behavior—that influence this effect. These findings illuminate a fundamental tradeoff between disentangled, transferable representations and input-information-rich representations, with implications for architecture design and transfer learning. Overall, the work provides a quantitative framework linking input/output geometry, nonlinearity, and learned representations, with metrics like kernel alignment, PS, and CCGP to evaluate geometry and generalization across tasks and architectures.

Abstract

The utility of a learned neural representation depends on how well its geometry supports performance in downstream tasks. This geometry depends on the structure of the inputs, the structure of the target outputs, and the architecture of the network. By studying the learning dynamics of networks with one hidden layer, we discovered that the network's activation function has an unexpectedly strong impact on the representational geometry: Tanh networks tend to learn representations that reflect the structure of the target outputs, while ReLU networks retain more information about the structure of the raw inputs. This difference is consistently observed across a broad class of parameterized tasks in which we modulated the degree of alignment between the geometry of the task inputs and that of the task labels. We analyzed the learning dynamics in weight space and show how the differences between the networks with Tanh and ReLU nonlinearities arise from the asymmetric asymptotic behavior of ReLU, which leads feature neurons to specialize for different regions of input space. By contrast, feature neurons in Tanh networks tend to inherit the task label structure. Consequently, when the target outputs are low dimensional, Tanh networks generate neural representations that are more disentangled than those obtained with a ReLU nonlinearity. Our findings shed light on the interplay between input-output geometry, nonlinearity, and learned representations in neural networks.
Paper Structure (22 sections, 3 equations, 10 figures)

This paper contains 22 sections, 3 equations, 10 figures.

Figures (10)

  • Figure 1: A. Schematic of binary classification task with unstructured inputs. B. Measures of representational geometry during training. Error bars indicate standard deviation over 20 simulated networks. C. Schematic illustrating the inter-class axis and and intra-class axis (left) and the procedure for computing the expected gradients of the task loss with respect to the input weights, projected along these axes (right two panels). The derivative $f'$ of the activation function is shown by the shading of space, and the vector $\vec{w}$ indicates the current value of the input-layer weight being considered. In this example, in the ReLU case, only the $x_2$ data point contributes to the gradient (red arrow). In the Tanh case, the gradient (dashed arrow) receives contributions from all four data points (colored arrows). D. Trajectories of input weights to hidden layer neurons along the inter-class and the intra-class axes. Each line segment represents an individual neuron from a simulation, and small circles indicate the initial conditions. Vector field indicates the gradient of the task objective.
  • Figure 2: A. Schematic of binary classification task with input structure parameterized by $\delta$, a factor indicating the degree of separability of the two classes (green and magenta clusters). B. Measures of representational geometry following network training as a function of $\delta$. Error bars indicate standard deviation over 20 simulated networks. C. Trajectories of input weights to hidden layer neurons as in Fig. \ref{['fig:Fig1']}D, for different values of $\delta$.
  • Figure 3: Values of various representational metrics for different values of $\delta$ (separability of trained dichotomy) and $\sigma^2$ (training noise) in the $\delta$-separable classification task of Section \ref{['sec:deltaxor']}.
  • Figure 4: A. Cartoon of the random sampling process, illustrated for $P=4$ inputs and $k=2$ outputs. B. Target alignment, PS, and CCGP as functions of input-output alignment in random classification tasks, for different values of $P$ (columns) and $k$ (rows). Cartoons schematize the target geometry for each value of $k$ (the number of target dimensions). In the plots, solid lines are the unique maximum-dimensional input geometry for specified alignment, and dots are 12 random samples of other lower-dimensional geometries. All tasks have a training noise variance of 1. C. Metrics in the final layer of a convolutional network trained on CIFAR10. Error bars are standard errors over random initializations.
  • Figure 5: Target and input kernel alignment for the representations at each hidden layer in multi-layer networks. Each network is trained until convergence. The inputs are generated as in Fig. \ref{['fig:large']}, with the addition of a constraint on the dimensionality of the inputs for the 'hard' task (see Section \ref{['sec:multilayer']} and Appendix F). All tasks have a training noise variance of 1.
  • ...and 5 more figures