Should Under-parameterized Student Networks Copy or Average Teacher Weights?

Berfin Şimşek; Amire Bendjeddou; Wulfram Gerstner; Johanni Brea

Should Under-parameterized Student Networks Copy or Average Teacher Weights?

Berfin Şimşek, Amire Bendjeddou, Wulfram Gerstner, Johanni Brea

TL;DR

The paper analyzes distilling an under-parameterized student network from a large two-layer teacher under Gaussian inputs with orthogonal teacher weights. By re-expressing the loss through neuron interactions with a constrained optimization over order parameters, it proves that copy-average configurations are critical points for unit-orthonormal teachers, with an optimal structure where n−1 students copy and the remaining neuron averages the rest; it provides closed-form solutions for the one-neuron case for erf and ReLU and shows that the copy-average arrangement yields the best approximation error L_erf^*(k−n+1) in the erf setting. Empirically, gradient flow converges to CA points or near full-copy configurations, and a phase diagram reveals regimes of CA-dominant behavior with compression; these results suggest a universal organizational pattern in underparameterized networks and practical warm-start strategies for distillation. The work leverages a novel interaction-function framework to connect weight-space optimization with tractable order-parameter analysis, enabling exact results in otherwise intractable nonconvex settings and informing both theory and practice of model compression.

Abstract

Any continuous function $f^*$ can be approximated arbitrarily well by a neural network with sufficiently many neurons $k$. We consider the case when $f^*$ itself is a neural network with one hidden layer and $k$ neurons. Approximating $f^*$ with a neural network with $n< k$ neurons can thus be seen as fitting an under-parameterized "student" network with $n$ neurons to a "teacher" network with $k$ neurons. As the student has fewer neurons than the teacher, it is unclear, whether each of the $n$ student neurons should copy one of the teacher neurons or rather average a group of teacher neurons. For shallow neural networks with erf activation function and for the standard Gaussian input distribution, we prove that "copy-average" configurations are critical points if the teacher's incoming vectors are orthonormal and its outgoing weights are unitary. Moreover, the optimum among such configurations is reached when $n-1$ student neurons each copy one teacher neuron and the $n$-th student neuron averages the remaining $k-n+1$ teacher neurons. For the student network with $n=1$ neuron, we provide additionally a closed-form solution of the non-trivial critical point(s) for commonly used activation functions through solving an equivalent constrained optimization problem. Empirically, we find for the erf activation function that gradient flow converges either to the optimal copy-average critical point or to another point where each student neuron approximately copies a different teacher neuron. Finally, we find similar results for the ReLU activation function, suggesting that the optimal solution of underparameterized networks has a universal structure.

Should Under-parameterized Student Networks Copy or Average Teacher Weights?

TL;DR

Abstract

Any continuous function

can be approximated arbitrarily well by a neural network with sufficiently many neurons

. We consider the case when

itself is a neural network with one hidden layer and

neurons. Approximating

with a neural network with

neurons can thus be seen as fitting an under-parameterized "student" network with

neurons to a "teacher" network with

neurons. As the student has fewer neurons than the teacher, it is unclear, whether each of the

student neurons should copy one of the teacher neurons or rather average a group of teacher neurons. For shallow neural networks with erf activation function and for the standard Gaussian input distribution, we prove that "copy-average" configurations are critical points if the teacher's incoming vectors are orthonormal and its outgoing weights are unitary. Moreover, the optimum among such configurations is reached when

student neurons each copy one teacher neuron and the

-th student neuron averages the remaining

teacher neurons. For the student network with

neuron, we provide additionally a closed-form solution of the non-trivial critical point(s) for commonly used activation functions through solving an equivalent constrained optimization problem. Empirically, we find for the erf activation function that gradient flow converges either to the optimal copy-average critical point or to another point where each student neuron approximately copies a different teacher neuron. Finally, we find similar results for the ReLU activation function, suggesting that the optimal solution of underparameterized networks has a universal structure.

Paper Structure (34 sections, 19 theorems, 215 equations, 10 figures, 1 table)

This paper contains 34 sections, 19 theorems, 215 equations, 10 figures, 1 table.

Introduction
Related Work
Setup
Foundations & Constrained Optimization Formulation
Copy-Average Critical Points
Approximation Error of Underparameterized Networks
One-Neuron Network
Multi-Neuron Network
Conclusion & Future Directions
Summary of Results
Further Comparison to Literature
Further Experiments
One-Neuron Network
Erf Experiments
ReLU Experiments
...and 19 more sections

Key Result

Proposition 4.1

Assume that $f^*$ is an orthogonal teacher network (Eq. eq:orthogonal-teacher) of width $k$. If the activation function satisfies Assumption ass:interaction (i), any non-trivial critical point $\theta^* = (w^*, a^*)$, i.e. $\nabla L^{1, k}(\theta^*) = 0$, $\| w^* \| \neq 0$, $a^*\neq 0$, satisfies t

Figures (10)

Figure 1: The gradient flow converges to the copy-average optimum point for erf activation (top), or nearby for ReLU activation (bottom): the first $n\!-\!1$ neurons copy one teacher neuron each; the $n$-th neuron takes an average of the remaining teacher neurons. The teacher network is unit-orthonormal, i.e. $f^*(x) \!=\! \sum_{j=1}^k \! \sigma(v_j \! \cdot \! x)$ where $v_j \! \in \! \mathbb{R}^d$'s are orthonormal, and $d\!=\!k\!+\!1$. A1 The gradient flow trajectory is shown in the weight space for $n\!=\!2, k\!=\!3$: the positions of the circles (red and green) represent incoming vector $w_i$ projected down to the span of $v_1, v_2, v_3$ and the sizes of the circles represent outgoing weights $a_i$. The blue circle represents the one-neuron solution (the position shows $w^*$, the size shows $a^*$). A2 Same setting, the weight-space parameters at convergence are mapped to the order-parameter space; $u_i=(u_{i1}, ..., u_{ik})$ where $u_{ij}$ represents the normalized dot product between $w_i$ and $v_j$ and $r_i=\|w_i \|$. B Order parameters shown at convergence for $n\!=\!4, k\!=\!8$. For erf (top) the point at convergence is exactly an $(n-1)$-copy-$1$-average point, whereas for ReLU, it is perturbed away from this configuration. Neurons are reordered for clarity.
Figure 2: Cartoon representation of the mapping of a student with three neurons from the weight space A$\mathbb R^{nd}$ to order parameter space B1-B2. The mapping between the outgoing weights is an identity mapping hence not shown. A Each axis shows the direction of weights $v_i$ of one teacher neuron ($k \geq 3$). B1 Each incoming vector $w_i \in \mathbb R^d$ is first transformed into $(r_i, w_i/r_i)$ and then $w_i/r_i$ is projected onto the span of the teacher's incoming vectors, yielding the student-teacher correlation vector $u_i=(u_{i1}, ..., u_{ik})$. B2 The student-student correlations $\rho_{i i'}$ are in general free parameters bounded in between $u_i \cdot u_{i'} \pm \sqrt{1-\| u_i \|^2} \sqrt{1-\| u_{i'} \|^2}$ hence the box constraint. An activated constraint, w.l.o.g. $u_1 \in \mathbb S^{k-1}$, gives a vanishing $\pm$ term for the interval of correlation $\rho_{1 i}$ for all $i \neq 1$, hence they are no longer free (shown in red). In the case $d=k$, all $u_i$ are on the hypersphere due to the problem geometry, hence the correlations $\rho_{i i'}$ are fixed and not free (see Appendix \ref{['sec:equality-constraints']}).
Figure 3: One-neuron network solutions.A Network output (color coded) as a function of input in $d=2$ for (left) a unit-orthonormal network with $k=2$ neurons (incoming vectors $v_1$ and $v_2$ are shown as black dots) and (right) the student network function generated by the optimal solution (incoming vector shown in red) for the erf activation function. B Same for the softplus activation function. C Approximation error of a student with $n=1$ neurons as a function of the number of $k$ teacher neurons. For large $k$, the approximation error for $n=1$ grows near-linearly for the differentiable activation functions studied in this paper (erf, sigmoid, tanh, and softplus with $\beta=1$); however the growth is quadratic for ReLU (see Appendix Corollary \ref{['corr:relu']}).
Figure 4: Under-parameterized student networks of width $n$ with erf activation function learning (via gradient flow) from a unit-orthonormal teacher network of width $k$.A Each dot is the mean error at convergence for $20$ seeds of random initializations; black-dashed lines are the theory predictions $L_{\text{erf}}^*(k\!-\!n\!+\!1)$, see Eq. \ref{['eq:conjecture']}. Standard deviations do not show on the figure as they are too small. We identify four regimes indicated by colors ( green- gray- blue- red) depending on the type of solution found by gradient flow (GF). In the green regime, GF converges to an optimal $(n-1)$-C-$1$-A solution for all $20$ initializations (Fig. \ref{['fig:erf-up-nets']}-B1). In the gray regime, GF converges either to $(n-1)$-C-$1$-A solution or to a "Perturbation of the all-copy solution" that we call P-$n$-C (Fig. \ref{['fig:erf-up-nets']}-B2). In the blue andred regimes, for $n > \gamma_2 k$ where $n=8,12,16$ the gradient flow converges to a P-$n$-C solution from all seeds (Fig. \ref{['fig:erf-up-nets']}-B3). Moreover, in the red regime, for $n > \gamma_3 k$ where $n=8,12,16$ and $\gamma_3$ is near $0.75$, the P-$n$-C solutions achieve lower loss than the $(n-1)$-C-$1$-A solutions (Fig. \ref{['fig:erf-up-nets']}-B4). B1-B4 Examples of loss at convergence (vertical axis) for all $20$ different initialization seeds (horizontal axis); theory is shown by the red-dashed horizontal line. Insets show examples of correlation matrices $u_{ij}$ ($k$ lines, $n$ columns) between student and teacher incoming vectors at convergence after reordering neurons. In the gray regime (for ex. B2) the gradient flow converges to either one of the two types of minima with correlations shown in the inset; in the other regimes, it consistently converges to the same minimum up to permutations.
Figure 5: Structure of the optimal solution of the one-neuron network for various activation functions. We trained $20$ seeds of one-neuron students learning from the unit-orthonormal teacher networks with $k=2, ..., 10$ neurons. All students converge to the same optimal solution up to symmetries (that is, positive-scaling symmetry for ReLU and sign symmetry for odd activation functions such as tanh and erf). A For ReLU, the magnitude $\| w^*\| a^*$ exactly matches with the result of Corollary \ref{['corr:relu']}. For softplus, the magnitude is very close to $\sqrt{k}$; for sigmoid, tanh, and erf, it is below $\sqrt{k}$. B The norm of the incoming vector is smaller than $1/\sqrt{k}$ for softplus, sigmoid, tanh, and erf. C The outgoing weight is larger than $k$ for softplus and tanh, and it is virtually $k$ for sigmoid and erf.
...and 5 more figures

Theorems & Definitions (34)

Proposition 4.1
Theorem 4.2
Theorem 5.1
Corollary 5.2
Theorem 5.3
Remark 5.4
Theorem 5.5
Remark D.1
Theorem E.1
proof
...and 24 more

Should Under-parameterized Student Networks Copy or Average Teacher Weights?

TL;DR

Abstract

Should Under-parameterized Student Networks Copy or Average Teacher Weights?

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (10)

Theorems & Definitions (34)