Table of Contents
Fetching ...

Improve Generalization Ability of Deep Wide Residual Network with A Suitable Scaling Factor

Songtao Tian, Zixiong Yu

TL;DR

It is shown that if $\alpha$ is a constant, the class of functions induced by Residual Neural Tangent Kernel (RNTK) is asymptotically not learnable, as the depth goes to infinity, as the depth goes to infinity.

Abstract

Deep Residual Neural Networks (ResNets) have demonstrated remarkable success across a wide range of real-world applications. In this paper, we identify a suitable scaling factor (denoted by $α$) on the residual branch of deep wide ResNets to achieve good generalization ability. We show that if $α$ is a constant, the class of functions induced by Residual Neural Tangent Kernel (RNTK) is asymptotically not learnable, as the depth goes to infinity. We also highlight a surprising phenomenon: even if we allow $α$ to decrease with increasing depth $L$, the degeneration phenomenon may still occur. However, when $α$ decreases rapidly with $L$, the kernel regression with deep RNTK with early stopping can achieve the minimax rate provided that the target regression function falls in the reproducing kernel Hilbert space associated with the infinite-depth RNTK. Our simulation studies on synthetic data and real classification tasks such as MNIST, CIFAR10 and CIFAR100 support our theoretical criteria for choosing $α$.

Improve Generalization Ability of Deep Wide Residual Network with A Suitable Scaling Factor

TL;DR

It is shown that if is a constant, the class of functions induced by Residual Neural Tangent Kernel (RNTK) is asymptotically not learnable, as the depth goes to infinity, as the depth goes to infinity.

Abstract

Deep Residual Neural Networks (ResNets) have demonstrated remarkable success across a wide range of real-world applications. In this paper, we identify a suitable scaling factor (denoted by ) on the residual branch of deep wide ResNets to achieve good generalization ability. We show that if is a constant, the class of functions induced by Residual Neural Tangent Kernel (RNTK) is asymptotically not learnable, as the depth goes to infinity. We also highlight a surprising phenomenon: even if we allow to decrease with increasing depth , the degeneration phenomenon may still occur. However, when decreases rapidly with , the kernel regression with deep RNTK with early stopping can achieve the minimax rate provided that the target regression function falls in the reproducing kernel Hilbert space associated with the infinite-depth RNTK. Our simulation studies on synthetic data and real classification tasks such as MNIST, CIFAR10 and CIFAR100 support our theoretical criteria for choosing .
Paper Structure (31 sections, 22 theorems, 127 equations, 5 figures)

This paper contains 31 sections, 22 theorems, 127 equations, 5 figures.

Key Result

Lemma 1

$r^{(L)}$ is positive definite on $\mathbb S^{d-1}$ when $L\geq2$.

Figures (5)

  • Figure 1: Average output of RNTK for random input $\boldsymbol{x},\boldsymbol{x}'\in \mathrm{Uniform}(\mathbb S^2)$ with increasing $L$
  • Figure 2: Test error for synthetic data from $\mathrm{Uniform}(\mathbb S^2)$ with different $\alpha$
  • Figure 3: Test accuracy for MNIST $10$ with different $\alpha$
  • Figure 4: Test accuracy for CIFAR $10$ with different $\alpha$
  • Figure 5: Test accuracy of ResNet with different $\alpha$. Left: CIFAR10; Right: CIFAR100

Theorems & Definitions (37)

  • Definition 1
  • Lemma 1
  • Corollary 1
  • Proposition 1
  • Corollary 2: Loss approximation
  • Theorem 1
  • Theorem 2
  • Remark 1
  • Proposition 2: Theorem 4.8 in belfer2021spectral
  • Definition 2: Falling factorial
  • ...and 27 more