Table of Contents
Fetching ...

Koopman-based generalization bound: New aspect for full-rank weights

Yuka Hashimoto, Sho Sonoda, Isao Ishikawa, Atsushi Nitanda, Taiji Suzuki

TL;DR

A new bound for generalization of neural networks using Koopman operators is proposed, which is tighter than existing norm-based bounds when the condition numbers of weight matrices are small and completely independent of the width of the network if the weightMatrices are orthogonal.

Abstract

We propose a new bound for generalization of neural networks using Koopman operators. Whereas most of existing works focus on low-rank weight matrices, we focus on full-rank weight matrices. Our bound is tighter than existing norm-based bounds when the condition numbers of weight matrices are small. Especially, it is completely independent of the width of the network if the weight matrices are orthogonal. Our bound does not contradict to the existing bounds but is a complement to the existing bounds. As supported by several existing empirical results, low-rankness is not the only reason for generalization. Furthermore, our bound can be combined with the existing bounds to obtain a tighter bound. Our result sheds new light on understanding generalization of neural networks with full-rank weight matrices, and it provides a connection between operator-theoretic analysis and generalization of neural networks.

Koopman-based generalization bound: New aspect for full-rank weights

TL;DR

A new bound for generalization of neural networks using Koopman operators is proposed, which is tighter than existing norm-based bounds when the condition numbers of weight matrices are small and completely independent of the width of the network if the weightMatrices are orthogonal.

Abstract

We propose a new bound for generalization of neural networks using Koopman operators. Whereas most of existing works focus on low-rank weight matrices, we focus on full-rank weight matrices. Our bound is tighter than existing norm-based bounds when the condition numbers of weight matrices are small. Especially, it is completely independent of the width of the network if the weight matrices are orthogonal. Our bound does not contradict to the existing bounds but is a complement to the existing bounds. As supported by several existing empirical results, low-rankness is not the only reason for generalization. Furthermore, our bound can be combined with the existing bounds to obtain a tighter bound. Our result sheds new light on understanding generalization of neural networks with full-rank weight matrices, and it provides a connection between operator-theoretic analysis and generalization of neural networks.
Paper Structure (37 sections, 11 theorems, 61 equations, 5 figures, 1 table)

This paper contains 37 sections, 11 theorems, 61 equations, 5 figures, 1 table.

Key Result

Proposition 2

Let $p(\omega)=1/(1+\Vert \omega\Vert^2)^s$ for $\omega\in\mathbb{R}^{d}$$s\in\mathbb{N}$, and $s>d/2$. If the activation function $\sigma$ has the following properties, then $K_{\sigma}:H_p(\mathbb{R}^{d})\to H_p(\mathbb{R}^{d})$ is bounded.

Figures (5)

  • Figure 1: (a) Scatter plot of the generalization error versus our bound (for 5 independent runs). The color is set to get dark as the epoch proceeds. (b) Test accuracy with and without the regularization based on our bound. (c) The condition number $r_{d,j}=\eta_{1,j}/\eta_{d,j}$ of the weight matrix for layer $j=2,\ldots,4$.
  • Figure 2: Behavior of the value $\vert \cos(\theta)\vert$. Here, $\theta$ is the maximum value of the angles between the output of the second layer and the directions of singular vectors of $W_3$ associated with the singular values that are larger than $0.1$.
  • Figure 3: The ratio $r_{d,j}=\eta_{1,j}/\eta_{d,j}$ of singular values (condition number) of weight matrices for layers $j=1,2,4$. (Right) Without regularization (Left) With the regularization based on our bound.
  • Figure 4: Test accuracy of AlexNet traind by CIFAR-10 with and without regularization.
  • Figure 5: Test and train loss of AlexNet trained by CIFAR-10 with and without regularization. (Right) Test loss (Left) Train loss.

Theorems & Definitions (26)

  • Example 1
  • Remark 1
  • Proposition 2
  • Example 2
  • Remark 3
  • Theorem 4: First Main Theorem
  • Lemma 5
  • Proposition 6
  • Remark 7
  • Remark 8
  • ...and 16 more