Table of Contents
Fetching ...

An Adaptive Tangent Feature Perspective of Neural Networks

Daniel LeJeune, Sina Alemohammad

TL;DR

This work considers linear transformations of features, resulting in a joint optimization over parameters and transformations with a bilinear interpolation constraint, and shows that this optimization problem has an equivalent linearly constrained optimization with structured regularization that encourages approximately low rank solutions.

Abstract

In order to better understand feature learning in neural networks, we propose a framework for understanding linear models in tangent feature space where the features are allowed to be transformed during training. We consider linear transformations of features, resulting in a joint optimization over parameters and transformations with a bilinear interpolation constraint. We show that this optimization problem has an equivalent linearly constrained optimization with structured regularization that encourages approximately low rank solutions. Specializing to neural network structure, we gain insights into how the features and thus the kernel function change, providing additional nuance to the phenomenon of kernel alignment when the target function is poorly represented using tangent features. We verify our theoretical observations in the kernel alignment of real neural networks.

An Adaptive Tangent Feature Perspective of Neural Networks

TL;DR

This work considers linear transformations of features, resulting in a joint optimization over parameters and transformations with a bilinear interpolation constraint, and shows that this optimization problem has an equivalent linearly constrained optimization with structured regularization that encourages approximately low rank solutions.

Abstract

In order to better understand feature learning in neural networks, we propose a framework for understanding linear models in tangent feature space where the features are allowed to be transformed during training. We consider linear transformations of features, resulting in a joint optimization over parameters and transformations with a bilinear interpolation constraint. We show that this optimization problem has an equivalent linearly constrained optimization with structured regularization that encourages approximately low rank solutions. Specializing to neural network structure, we gain insights into how the features and thus the kernel function change, providing additional nuance to the phenomenon of kernel alignment when the target function is poorly represented using tangent features. We verify our theoretical observations in the kernel alignment of real neural networks.
Paper Structure (27 sections, 5 theorems, 30 equations, 3 figures)

This paper contains 27 sections, 5 theorems, 30 equations, 3 figures.

Key Result

Theorem 1

There is a solution to eq:adaptive-kernel-learning with $\Omega = \Omega_\omega$ satisfying where $s = \mathop{\mathrm{arg\,min}}\limits_{z \geq 1} \omega(z) + \frac{\|\widehat{{\bm{\beta}}}\|_{2}^2}{z^2}$ and $\widehat{{\bm{\beta}}}$ is given by eq:equivalent-optimization with $\widetilde{\Omega} = \|\cdot\|_{2}$. Furthermore, the adapted kernel for this solution is given by

Figures (3)

  • Figure 1: A more difficult task yield higher label kernel alignment. We perform regression using a multi-layer perceptron on 500 MNIST digits from classes 2 and 3. We construct target labels $y_i$ as the best linear fit of binary $\pm 1$ labels using random neural network features. Then we train two networks, one (left) trained to predict $y_i$, and one (right) trained to predict $\mathrm{sign}(y_i)$. We present the adapted kernel and label kernel matrices for data points ordered according to $y_i$ and report the cosine similarity of the adapted kernel and the label kernel. The harder task of regression with binarized labels has a higher label kernel alignment. Further details are given in \ref{['sec:mnist_reg_details']}.
  • Figure 2: Effective penalties are sub-quadrataic. We plot effective penalties for $\omega_1(v) = (v - 1)^2$, $\omega_2(v) = |v - 1|$, and $\omega_3 = \omega_1 \oplus \omega_2$. All are sub-quadratic, yet all behave like $v^2$ near $v = 0$.
  • Figure 3: Adapted kernel reveals difficult structure. For the neural network from \ref{['fig:mnist_reg_kernel']} trained on binarized labels $\mathrm{sign}(y_i)$ (left), the target function (green, solid) is difficult while the function $\mathbf{x} \mapsto y$ (black, dotted) is easily predicted using tangent features. The network must learn (right) to fit the residual (red, dashed), which results in the kernel (orange, $\times$) being highly influenced by difficult training points (near $y = 0$).

Theorems & Definitions (11)

  • Theorem 1
  • proof : Proof sketch
  • Theorem 2
  • Proposition 3
  • Proposition 4
  • proof
  • proof
  • Lemma 5
  • proof
  • proof
  • ...and 1 more