Variation Spaces for Multi-Output Neural Networks: Insights on Multi-Task Learning and Network Compression

Joseph Shenouda; Rahul Parhi; Kangwook Lee; Robert D. Nowak

Variation Spaces for Multi-Output Neural Networks: Insights on Multi-Task Learning and Network Compression

Joseph Shenouda, Rahul Parhi, Kangwook Lee, Robert D. Nowak

TL;DR

The paper develops a vector-valued variation space framework (VV spaces) for multi-output neural networks, tying weight decay to a multi-task regularizer and proving a representer theorem that reduces infinite-dimensional learning to finite-width vector-valued networks. It shows that weight decay induces neuron sharing across outputs and derives data-dependent width bounds that scale with intrinsic representation ranks, enabling principled DNN compression via convex multi-task Lasso. The results yield a dimension-agnostic approximation property, a deep-network extension of the representer theorem, and practical compression procedures validated on standard architectures. Collectively, the work provides a rigorous connection between regularization, inductive bias, and efficient architectures for multi-task learning with neural networks, with concrete algorithms for compressing pre-trained models without sacrificing performance.

Abstract

This paper introduces a novel theoretical framework for the analysis of vector-valued neural networks through the development of vector-valued variation spaces, a new class of reproducing kernel Banach spaces. These spaces emerge from studying the regularization effect of weight decay in training networks with activations like the rectified linear unit (ReLU). This framework offers a deeper understanding of multi-output networks and their function-space characteristics. A key contribution of this work is the development of a representer theorem for the vector-valued variation spaces. This representer theorem establishes that shallow vector-valued neural networks are the solutions to data-fitting problems over these infinite-dimensional spaces, where the network widths are bounded by the square of the number of training data. This observation reveals that the norm associated with these vector-valued variation spaces encourages the learning of features that are useful for multiple tasks, shedding new light on multi-task learning with neural networks. Finally, this paper develops a connection between weight-decay regularization and the multi-task lasso problem. This connection leads to novel bounds for layer widths in deep networks that depend on the intrinsic dimensions of the training data representations. This insight not only deepens the understanding of the deep network architectural requirements, but also yields a simple convex optimization method for deep neural network compression. The performance of this compression procedure is evaluated on various architectures.

Variation Spaces for Multi-Output Neural Networks: Insights on Multi-Task Learning and Network Compression

TL;DR

Abstract

Paper Structure (32 sections, 10 theorems, 102 equations, 5 figures, 1 table)

This paper contains 32 sections, 10 theorems, 102 equations, 5 figures, 1 table.

Introduction
Organization and Main Contributions
Weight Decay and the Neural Balance Theorem
Vector-Valued Variation Spaces
Scalar-Valued Variation Spaces
Vector-Valued Variation Spaces
The Curse of Dimensionality
Representer Theorem for Vector-Valued Variation Spaces
A Representer Theorem for Deep Neural Networks
Neuron Sharing in Neural Network Solutions
Data-Dependent Width Bounds and DNN Compression
Width Bounds
Sparsity of Solutions to the Multi-Task Lasso Problem
Experiments
Neuron Sharing Simulation
...and 17 more sections

Key Result

Theorem 1

Let $f_{\bm{\theta}}$ be a DNN of any architecture such that ${\bm{\theta}}$ minimizes opt:general_weight_decay. Then, the weights satisfy the following balance constraint: If ${\bm{w}}$ and ${\bm{v}}$ denote the input and output weights of any neuron with a homogeneous activation function, then $\l

Figures (5)

Figure 1: Three neural networks with different weight-sparsity patterns. The input weights are normalized to lie on the sphere and the components of the output weights are all $O(1)$. In the case of homogeneous activation functions, weight decay minimizes the $\mathcal{V}_{\sigma}(\mathbb{R}^d; \mathbb{R}^D)$ norm and therefore favors the right-most architecture. This architecture exhibits both neuron sparsity and neuron sharing. Each output depends on the same few neurons. This observation also gives insight into the regularity of the optimal functions: They favor functions that only vary in a few directions across all outputs. This is in contrast with the middle network where each output has variation in a small number of directions, but this set of directions can be different for each output.
Figure 2: We trained a three output two-dimensional ReLU neural network of the form ${\bm{f}}(x)=\sum_{k=1}^{K} {\bm{v}}_k \sigma({\bm{w}}^{T}_{k}{\bm{x}}+b_k)$ with weight decay, $\ell^1$-regularization, and no regularization. Let $f_1$, $f_2$ and $f_3$ denote the first, second, and third components of the outputs. We plot the locations of each active neurons under the $(\theta_k, b_k)$-parameterization. The size of the circles indicate the magnitude of the corresponding output weight vector. We see that in the case of weight decay, we have very few active neurons. Furthermore, those neurons that remain are shared across all outputs.
Figure 3: Distribution of the number of active columns for the solutions to the multi-task lasso problem on randomly generated matrices of varying sizes. The horizontal axis is the number of nonzero columns in the optimal $\mathbf{V}$ and the vertical axis is the frequency. We ran this experiment for 100 randomly generated matrices. In all cases $r_{\mathbf{\Phi}}=N$ and $r_{\mathbf{\Psi}}=D$ so by \ref{['thm:multi_task_lasso_bound']} we expect $N \leq \widehat{K}\leq ND$. The shaded region indicates our theoretical bounds. The wide gap suggests that our upper bound can be sharpened.
Figure 4: Distribution of the number of active columns for the solutions to the multi-task lasso problem on randomly generated matrices with $\mathbf{\Phi}$ of various rank. The horizontal axis is the number of nonzero columns in the optimal $\mathbf{V}$ and the vertical axis is the frequency. We ran this experiment for 100 randomly generated matrices, in all cases $D=10, N=20$ and $K=200$. By \ref{['thm:multi_task_lasso_bound']} we expect $r_{\mathbf{\Phi}} \leq \widehat{K} \leq r_{\mathbf{\Phi}} \cdot r_{\mathbf{\Psi}}$. The shaded region indicates our theoretical bounds.
Figure 5: Distribution of number of nonzero columns for the solutions to the multi-task lasso problem. We ran this experiment over 1000 randomly generated matrices $\mathbf{\Phi}$ and $\mathbf{\Psi}$. The horizontal axis is the number of nonzero columns in the optimal $\mathbf{V}$ where each bin is left-inclusive corresponding to a single integer. The vertical axis is the frequency. We see that the sparsest solution is dependent on the data and can vary between our bounds. The area between the two shaded regions indicate our theoretical bounds. The gap indicates that our upper bound can be sharpened.

Theorems & Definitions (16)

Theorem 1: Neural Balance Theorem
Remark 2
Theorem 3
Remark 4
Theorem 5
Corollary 6
Remark 7
Theorem 8
Theorem 9
Theorem 10
...and 6 more

Variation Spaces for Multi-Output Neural Networks: Insights on Multi-Task Learning and Network Compression

TL;DR

Abstract

Variation Spaces for Multi-Output Neural Networks: Insights on Multi-Task Learning and Network Compression

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (16)