Table of Contents
Fetching ...

Understanding How Nonlinear Layers Create Linearly Separable Features for Low-Dimensional Data

Alec S. Xu, Can Yaras, Peng Wang, Qing Qu

TL;DR

This work analyzes how a shallow nonlinear network with random Gaussian weights and a quadratic activation can transform data drawn from a union of low-dimensional subspaces into linearly separable sets. The authors prove that for two subspaces, linear separability of the transformed features occurs with high probability when the hidden-layer width grows polynomially with the intrinsic dimension r, specifically requiring width $D$ that scales as a function of r and the principal angles between subspaces. They extend the result to multiple subspaces K>2 via one-vs-all separation, and provide experimental evidence on synthetic data and CIFAR-10 (via MCR^2 representations) showing the practical relevance of the theory and robustness to other activations. The results offer a theoretical bridge between observed linear separability in early network layers and the role of overparameterization and random features in generalization, with implications for interpretability and the design of representation-learning systems. Overall, the paper deepens our understanding of how shallow nonlinear mappings contribute to the discriminative structure of neural representations under low intrinsic dimensionality assumptions.

Abstract

Deep neural networks have attained remarkable success across diverse classification tasks. Recent empirical studies have shown that deep networks learn features that are linearly separable across classes. However, these findings often lack rigorous justifications, even under relatively simple settings. In this work, we address this gap by examining the linear separation capabilities of shallow nonlinear networks. Specifically, inspired by the low intrinsic dimensionality of image data, we model inputs as a union of low-dimensional subspaces (UoS) and demonstrate that a single nonlinear layer can transform such data into linearly separable sets. Theoretically, we show that this transformation occurs with high probability when using random weights and quadratic activations. Notably, we prove this can be achieved when the network width scales polynomially with the intrinsic dimension of the data rather than the ambient dimension. Experimental results corroborate these theoretical findings and demonstrate that similar linear separation properties hold in practical scenarios beyond our analytical scope. This work bridges the gap between empirical observations and theoretical understanding of the separation capacity of nonlinear networks, offering deeper insights into model interpretability and generalization.

Understanding How Nonlinear Layers Create Linearly Separable Features for Low-Dimensional Data

TL;DR

This work analyzes how a shallow nonlinear network with random Gaussian weights and a quadratic activation can transform data drawn from a union of low-dimensional subspaces into linearly separable sets. The authors prove that for two subspaces, linear separability of the transformed features occurs with high probability when the hidden-layer width grows polynomially with the intrinsic dimension r, specifically requiring width that scales as a function of r and the principal angles between subspaces. They extend the result to multiple subspaces K>2 via one-vs-all separation, and provide experimental evidence on synthetic data and CIFAR-10 (via MCR^2 representations) showing the practical relevance of the theory and robustness to other activations. The results offer a theoretical bridge between observed linear separability in early network layers and the role of overparameterization and random features in generalization, with implications for interpretability and the design of representation-learning systems. Overall, the paper deepens our understanding of how shallow nonlinear mappings contribute to the discriminative structure of neural representations under low intrinsic dimensionality assumptions.

Abstract

Deep neural networks have attained remarkable success across diverse classification tasks. Recent empirical studies have shown that deep networks learn features that are linearly separable across classes. However, these findings often lack rigorous justifications, even under relatively simple settings. In this work, we address this gap by examining the linear separation capabilities of shallow nonlinear networks. Specifically, inspired by the low intrinsic dimensionality of image data, we model inputs as a union of low-dimensional subspaces (UoS) and demonstrate that a single nonlinear layer can transform such data into linearly separable sets. Theoretically, we show that this transformation occurs with high probability when using random weights and quadratic activations. Notably, we prove this can be achieved when the network width scales polynomially with the intrinsic dimension of the data rather than the ambient dimension. Experimental results corroborate these theoretical findings and demonstrate that similar linear separation properties hold in practical scenarios beyond our analytical scope. This work bridges the gap between empirical observations and theoretical understanding of the separation capacity of nonlinear networks, offering deeper insights into model interpretability and generalization.
Paper Structure (54 sections, 62 equations, 8 figures)

This paper contains 54 sections, 62 equations, 8 figures.

Figures (8)

  • Figure 1: Linear separability and compression of features across layers. The initial layers transform the input to be linearly separable, while the deeper layers compress the features. Following the setup in wang2023understanding, we trained two networks on CIFAR-10: a 6-layer multi-layer perceptron (MLP, left) and a 6-layer hybrid network (a 3-layer MLP followed by a 3-layer linear network, right), both with hidden dimensions of $1024$. For each trained network, we conducted linear probing on the features from each layer. At each layer, we recorded the linear probe accuracy and the numerical rank of the feature matrix, defined as the minimum number of singular values accounting for at least $95\%$ of the nuclear norm, and plotted these results.
  • Figure 2: Phase transition of linear separability w.r.t. dimensions $(d,r)$ and network width $D$. We demonstrate that the network width required to achieve linear separability of a union of two subspaces scales polynomially with the intrinsic dimension. See \ref{['ssec:phase-transition']} for details.
  • Figure 3: The principal angle between a 1-dim subspace $\mathcal{S}_1$ and 2-dim subspace $\mathcal{S}_2$.
  • Figure 4: An illustration of \ref{['prob:binary']}. We aim to find conditions on the network $f$ so a union of subspaces (left) transforms into linearly separable sets (right).
  • Figure 5: Activation alone is insufficient for linearly separating two subspaces. When $\mathcal{S}_1 = \mathrm{span}(\bm{u}\xspace_1)$ and $\mathcal{S}_2 = \mathrm{span}(\bm{u}\xspace_2)$, the sets $\sigma(\mathcal{S}_1)$ and $\sigma(\mathcal{S}_2)$ are not linearly separable for $\sigma(\cdot) =$ quadratic (left) and $\sigma(\cdot) = \mathrm{ReLU}(\cdot)$ (right).
  • ...and 3 more figures

Theorems & Definitions (5)

  • proof
  • proof
  • proof
  • proof
  • proof