Spectrum Extraction and Clipping for Implicitly Linear Layers

Ali Ebrahimpour Boroojeny; Matus Telgarsky; Hari Sundaram

Spectrum Extraction and Clipping for Implicitly Linear Layers

Ali Ebrahimpour Boroojeny, Matus Telgarsky, Hari Sundaram

TL;DR

By comparing the accuracy and performance of the algorithms to the state-of-the-art methods, using various experiments, it is shown they are more precise and efficient and lead to better generalization and adversarial robustness.

Abstract

We show the effectiveness of automatic differentiation in efficiently and correctly computing and controlling the spectrum of implicitly linear operators, a rich family of layer types including all standard convolutional and dense layers. We provide the first clipping method which is correct for general convolution layers, and illuminate the representational limitation that caused correctness issues in prior work. We study the effect of the batch normalization layers when concatenated with convolutional layers and show how our clipping method can be applied to their composition. By comparing the accuracy and performance of our algorithms to the state-of-the-art methods, using various experiments, we show they are more precise and efficient and lead to better generalization and adversarial robustness. We provide the code for using our methods at https://github.com/Ali-E/FastClip.

Spectrum Extraction and Clipping for Implicitly Linear Layers

TL;DR

Abstract

Paper Structure (24 sections, 5 theorems, 19 equations, 5 figures, 3 tables, 3 algorithms)

This paper contains 24 sections, 5 theorems, 19 equations, 5 figures, 3 tables, 3 algorithms.

INTRODUCTION
Related Work
METHODS
Notation.
Spectrum Extraction
Clipping the Spectral Norm
Limitations of Convolutional Layers
Batch Normalization Layers
EXPERIMENTS
PowerQR
Clipping Method
Precision and Efficiency
Generalization and Robustness
Clipping Batch Norm
CONCLUSIONS
...and 9 more sections

Key Result

Proposition 2.1

Let $f(x) = Mx + b$. Then alg:powerqr correctly performs the shifted subspace iteration algorithm on $M$, with $\mu$ as the shift value.

Figures (5)

Figure 1: Comparison of the clipping methods in a simple network with only one convolutional layer and one dense layer, where the target value is $\pmb{1}$. Our method is the only one that clips this layer correctly for all different settings: 1. Kernel of size $3$ with reflect padding, 2. Kernel of size $3$ with same padding, 3. Kernel of size $3$ and zeros padding with stride of $2$, and 4. Kernel of size $5$ with same replicate padding and stride of $2$.
Figure 2: (a) The first three plots show the clipping of the convolutional layer in a simple two-layer network to various values on MNIST. As the clipping target gets smaller, the spectral norm of the batch norm layer compensates and becomes larger. Meanwhile, the spectral norm of their concatenation slightly decreases. (b) The right-most plot shows the spectral norm of a convolutional layer, its succeeding batch norm layer, and their concatenation from the clipped ResNet-18 model trained on CIFAR-10. Although the convolutional layer is clipped to $1$, the spectral norm of the concatenation is much larger due to the presence of the batch norm layer.
Figure 3: The layer-wise spectral norm of a ResNet-18 model trained on CIFAR-10 (a) and MNIST (b) using each of the clipping methods. The time columns shows the training time per epoch for these methods. c. The layer-wise spectral norm of a DLA model trained on CIFAR-10 using each of the clipping methods. The time column shows the training time per epoch for these methods. As all of the plots show, by using our method, all the layers have a spectral norm very close to the target value $\pmb{1}$. Our method is also much faster than the relatively accurate alternatives and shows a slower increase in running time as the model gets larger.
Figure 4: (a) Each of these three subplots shows the spectral norms of a convolutional layer, its succeeding batch norm layer, and their concatenation in a ResNet-18 model trained on CIFAR-10. The convolutional layers in this model are clipped to $1$. Instead of clipping the batch normalization layer, our method has been applied to the concatenation to control its spectral norm. (b) The rightmost subplot shows the training accuracy for the ResNet-18 model that is trained on CIFAR-10. One curve belongs to the model with the convolutional layers clipped to $1$ using FastClip and the batch norm layers clipped using the direct method used by prior works (FastClip-clip BN). The other two belong to FastClip and FastClip-concat.
Figure 5: a. The absolute difference in the spectral norm of convolutional layers with different padding types and their circulant approximates for various kernel sizes ($3$, $5$, and $7$) and numbers of channels. The values are computed by averaging over $100$ convolutional filters drawn from a normal distribution for each setting. b. Comparison of the run-time of PowerQR (\ref{['alg:powerqr']}) to that of the pipeline used by virmaux2018lipschitz for computing the top-$k$ singular values. We considered a 2d-convolutional layer with $3\times 3$ filters and $32$ input/output channels. The convolution is applied to a $32\times 32$ image.

Theorems & Definitions (11)

Proposition 2.1
Theorem 2.2
Remark 2.3
proof
Lemma A.1
proof
proof
Corollary A.2
Corollary A.3
proof
...and 1 more

Spectrum Extraction and Clipping for Implicitly Linear Layers

TL;DR

Abstract

Spectrum Extraction and Clipping for Implicitly Linear Layers

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (11)