Improve Generalization Ability of Deep Wide Residual Network with A Suitable Scaling Factor

Songtao Tian; Zixiong Yu

Improve Generalization Ability of Deep Wide Residual Network with A Suitable Scaling Factor

Songtao Tian, Zixiong Yu

TL;DR

It is shown that if $\alpha$ is a constant, the class of functions induced by Residual Neural Tangent Kernel (RNTK) is asymptotically not learnable, as the depth goes to infinity, as the depth goes to infinity.

Abstract

Deep Residual Neural Networks (ResNets) have demonstrated remarkable success across a wide range of real-world applications. In this paper, we identify a suitable scaling factor (denoted by $α$) on the residual branch of deep wide ResNets to achieve good generalization ability. We show that if $α$ is a constant, the class of functions induced by Residual Neural Tangent Kernel (RNTK) is asymptotically not learnable, as the depth goes to infinity. We also highlight a surprising phenomenon: even if we allow $α$ to decrease with increasing depth $L$, the degeneration phenomenon may still occur. However, when $α$ decreases rapidly with $L$, the kernel regression with deep RNTK with early stopping can achieve the minimax rate provided that the target regression function falls in the reproducing kernel Hilbert space associated with the infinite-depth RNTK. Our simulation studies on synthetic data and real classification tasks such as MNIST, CIFAR10 and CIFAR100 support our theoretical criteria for choosing $α$.

Improve Generalization Ability of Deep Wide Residual Network with A Suitable Scaling Factor

TL;DR

It is shown that if

is a constant, the class of functions induced by Residual Neural Tangent Kernel (RNTK) is asymptotically not learnable, as the depth goes to infinity, as the depth goes to infinity.

Abstract

Deep Residual Neural Networks (ResNets) have demonstrated remarkable success across a wide range of real-world applications. In this paper, we identify a suitable scaling factor (denoted by

) on the residual branch of deep wide ResNets to achieve good generalization ability. We show that if

is a constant, the class of functions induced by Residual Neural Tangent Kernel (RNTK) is asymptotically not learnable, as the depth goes to infinity. We also highlight a surprising phenomenon: even if we allow

to decrease with increasing depth

, the degeneration phenomenon may still occur. However, when

decreases rapidly with

, the kernel regression with deep RNTK with early stopping can achieve the minimax rate provided that the target regression function falls in the reproducing kernel Hilbert space associated with the infinite-depth RNTK. Our simulation studies on synthetic data and real classification tasks such as MNIST, CIFAR10 and CIFAR100 support our theoretical criteria for choosing

Paper Structure (31 sections, 22 theorems, 127 equations, 5 figures)

This paper contains 31 sections, 22 theorems, 127 equations, 5 figures.

Introduction
Major contributions
Properties of RNTK
Review of RNTK
Network Architecture and Initialization
Training
Residual neural network kernel (RNK) and residual neural tangent kernel (RNTK)
NTK of ResNet
Positiveness of RNTK
NNK uniformly converges to NTK
Criteria for choosing $\alpha$
Generalization error of deep RNTK for $\alpha=L^{-\gamma}$ with $0\leq\gamma<1/2$
Generalization error of deep RNTK for $\alpha=L^{-\gamma}$ with $\gamma>1/2$
Simulation studies
Fixed kernel
...and 16 more sections

Key Result

Lemma 1

$r^{(L)}$ is positive definite on $\mathbb S^{d-1}$ when $L\geq2$.

Figures (5)

Figure 1: Average output of RNTK for random input $\boldsymbol{x},\boldsymbol{x}'\in \mathrm{Uniform}(\mathbb S^2)$ with increasing $L$
Figure 2: Test error for synthetic data from $\mathrm{Uniform}(\mathbb S^2)$ with different $\alpha$
Figure 3: Test accuracy for MNIST $10$ with different $\alpha$
Figure 4: Test accuracy for CIFAR $10$ with different $\alpha$
Figure 5: Test accuracy of ResNet with different $\alpha$. Left: CIFAR10; Right: CIFAR100

Theorems & Definitions (37)

Definition 1
Lemma 1
Corollary 1
Proposition 1
Corollary 2: Loss approximation
Theorem 1
Theorem 2
Remark 1
Proposition 2: Theorem 4.8 in belfer2021spectral
Definition 2: Falling factorial
...and 27 more

Improve Generalization Ability of Deep Wide Residual Network with A Suitable Scaling Factor

TL;DR

Abstract

Improve Generalization Ability of Deep Wide Residual Network with A Suitable Scaling Factor

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (37)