Table of Contents
Fetching ...

Half-Space Feature Learning in Neural Networks

Mahesh Lorik Yadav, Harish Guruprasad Ramaswamy, Chandrashekar Lakshminarayanan

TL;DR

This work reframes neural feature learning through a mixture-of-experts lens, introducing the Deep Linearly Gated Network (DLGN) as a mid-point between deep linear nets and ReLU nets. Features in DLN correspond to intersections of half-spaces, enabling a global, interpretable view via active path regions and the overlap kernel, while maintaining nonlinear feature learning through linearly combined components. Gradient descent is proposed as a resource allocator that biases path usage toward simpler input regions, a phenomenon supported by experiments on synthetic data and standard benchmarks. The framework provides both theoretical (proofs) and empirical (CIFAR-10/100, Fashion MNIST, circle datasets) insights into how neural networks learn features, offering a transparent bridge to understand feature evolution in more common architectures like ReLU nets.

Abstract

There currently exist two extreme viewpoints for neural network feature learning -- (i) Neural networks simply implement a kernel method (a la NTK) and hence no features are learned (ii) Neural networks can represent (and hence learn) intricate hierarchical features suitable for the data. We argue in this paper neither interpretation is likely to be correct based on a novel viewpoint. Neural networks can be viewed as a mixture of experts, where each expert corresponds to a (number of layers length) path through a sequence of hidden units. We use this alternate interpretation to motivate a model, called the Deep Linearly Gated Network (DLGN), which sits midway between deep linear networks and ReLU networks. Unlike deep linear networks, the DLGN is capable of learning non-linear features (which are then linearly combined), and unlike ReLU networks these features are ultimately simple -- each feature is effectively an indicator function for a region compactly described as an intersection of (number of layers) half-spaces in the input space. This viewpoint allows for a comprehensive global visualization of features, unlike the local visualizations for neurons based on saliency/activation/gradient maps. Feature learning in DLGNs is shown to happen and the mechanism with which this happens is through learning half-spaces in the input space that contain smooth regions of the target function. Due to the structure of DLGNs, the neurons in later layers are fundamentally the same as those in earlier layers -- they all represent a half-space -- however, the dynamics of gradient descent impart a distinct clustering to the later layer neurons. We hypothesize that ReLU networks also have similar feature learning behaviour.

Half-Space Feature Learning in Neural Networks

TL;DR

This work reframes neural feature learning through a mixture-of-experts lens, introducing the Deep Linearly Gated Network (DLGN) as a mid-point between deep linear nets and ReLU nets. Features in DLN correspond to intersections of half-spaces, enabling a global, interpretable view via active path regions and the overlap kernel, while maintaining nonlinear feature learning through linearly combined components. Gradient descent is proposed as a resource allocator that biases path usage toward simpler input regions, a phenomenon supported by experiments on synthetic data and standard benchmarks. The framework provides both theoretical (proofs) and empirical (CIFAR-10/100, Fashion MNIST, circle datasets) insights into how neural networks learn features, offering a transparent bridge to understand feature evolution in more common architectures like ReLU nets.

Abstract

There currently exist two extreme viewpoints for neural network feature learning -- (i) Neural networks simply implement a kernel method (a la NTK) and hence no features are learned (ii) Neural networks can represent (and hence learn) intricate hierarchical features suitable for the data. We argue in this paper neither interpretation is likely to be correct based on a novel viewpoint. Neural networks can be viewed as a mixture of experts, where each expert corresponds to a (number of layers length) path through a sequence of hidden units. We use this alternate interpretation to motivate a model, called the Deep Linearly Gated Network (DLGN), which sits midway between deep linear networks and ReLU networks. Unlike deep linear networks, the DLGN is capable of learning non-linear features (which are then linearly combined), and unlike ReLU networks these features are ultimately simple -- each feature is effectively an indicator function for a region compactly described as an intersection of (number of layers) half-spaces in the input space. This viewpoint allows for a comprehensive global visualization of features, unlike the local visualizations for neurons based on saliency/activation/gradient maps. Feature learning in DLGNs is shown to happen and the mechanism with which this happens is through learning half-spaces in the input space that contain smooth regions of the target function. Due to the structure of DLGNs, the neurons in later layers are fundamentally the same as those in earlier layers -- they all represent a half-space -- however, the dynamics of gradient descent impart a distinct clustering to the later layer neurons. We hypothesize that ReLU networks also have similar feature learning behaviour.
Paper Structure (26 sections, 6 theorems, 27 equations, 16 figures, 2 tables)

This paper contains 26 sections, 6 theorems, 27 equations, 16 figures, 2 tables.

Key Result

Theorem 1

Let $\widehat{y}({\mathbf x})$ be the output of the mixture of experts model in Equation eqn:MoE, with the gating model $f_\pi$ and expert model $g_\pi$ given by Equations eqn:ReLU-net-gating-net and eqn:RelU-net-individual-expert-model. Then where $h_0({\mathbf x})={\mathbf x}$ and $h_\ell({\mathbf x})={\boldsymbol \phi}(W_\ell h_{\ell-1}({\mathbf x}))$ for $\ell \in \{1,\ldots,L-1\}$.

Figures (16)

  • Figure 1: Example ReLU nets with weights and biases such that-- (Left:) the active path region through red and green hidden nodes is $[0.5,1.5] \cup [2.5, 3.5]$. (Right): The active path region through the green and brown hidden nodes abruptly changes from the right half of the $2$-dimensional input space to only the first quadrant when the parameter $\epsilon$ changes sign.
  • Figure 2: Left top: Target label $y$ as a function of the angles that input ${\mathbf x}$ makes. Left Bottom: A scatter plot of the data with points colored according to the target $y$. Right: The training loss of ReLU and DLGN models. DLGN models with fixed gating take more than 2000 iterations to converge to loss ( 0.01) and are cropped here.
  • Figure 3: The (trace-normalised) overlap kernel for the ReLU net (left three images) and DLGN (right three images) at initialization (left), epoch 3 (middle) and epoch 200 (right). The data points are ordered based on angle -- first half corresponds to data in top half of the circle.
  • Figure 4: The (trace-normalised) empirical Neural Tangent Kernel for the ReLU net (left three images) and DLGN (right three images) visualised as a matrix at initialization (left), epoch 3 (middle) and epoch 200 (right)
  • Figure 5: The model outputs of the ReLU net (left), DLGN (middle) and DLGN with the gating model $f_\pi$ frozen (right) at 3 different epochs.
  • ...and 11 more figures

Theorems & Definitions (10)

  • Theorem 1
  • Theorem 2
  • Theorem
  • Lemma 3
  • proof
  • proof : Proof of Theorem 2
  • Theorem
  • Lemma 4
  • proof
  • proof : Proof of Theorem 1