Table of Contents
Fetching ...

1-Lipschitz Network Initialization for Certifiably Robust Classification Applications: A Decay Problem

Marius F. R. Juston, Ramavarapu S. Sreenivas, William R. Norris, Dustin Nottage, Ahmet Soylemezoglu

TL;DR

The paper addresses initialization challenges for deep $1$-Lipschitz networks using AOL and SDP-based Lipschitz Layers (SLL) in certifiably robust classification. It derives exact and upper bounds for parameterized weight variance under Normal and Generalized Normal initializations and shows that output variance depends only on layer dimensions, not weight variance, leading to inevitable decay with depth. It extends to generalized initializations (SGND/PGND), providing CFs, MGFs, and variance bounds, and demonstrates via experiments on Covertype that depth-induced decay harms training, even with bias-based stabilization. The findings highlight fundamental limitations of current initialization schemes for deep Lipschitz networks and motivate future work on backward-gradient stabilization and architectural remedies (e.g., residual connections) to preserve signal for robust certification.

Abstract

This paper discusses the weight parametrization of two standard 1-Lipschitz network architectures, the Almost-Orthogonal-Layers (AOL) and the SDP-based Lipschitz Layers (SLL). It examines their impact on initialization for deep 1-Lipschitz feedforward networks, and discusses underlying issues surrounding this initialization. These networks are mainly used in certifiably robust classification applications to combat adversarial attacks by limiting the impact of perturbations on the classification output. Exact and upper bounds for the parameterized weight variance were calculated assuming a standard Normal distribution initialization; additionally, an upper bound was computed assuming a Generalized Normal Distribution, generalizing the proof for Uniform, Laplace, and Normal distribution weight initializations. It is demonstrated that the weight variance holds no bearing on the output variance distribution and that only the dimension of the weight matrices matters. Additionally, this paper demonstrates that the weight initialization always causes deep 1-Lipschitz networks to decay to zero.

1-Lipschitz Network Initialization for Certifiably Robust Classification Applications: A Decay Problem

TL;DR

The paper addresses initialization challenges for deep -Lipschitz networks using AOL and SDP-based Lipschitz Layers (SLL) in certifiably robust classification. It derives exact and upper bounds for parameterized weight variance under Normal and Generalized Normal initializations and shows that output variance depends only on layer dimensions, not weight variance, leading to inevitable decay with depth. It extends to generalized initializations (SGND/PGND), providing CFs, MGFs, and variance bounds, and demonstrates via experiments on Covertype that depth-induced decay harms training, even with bias-based stabilization. The findings highlight fundamental limitations of current initialization schemes for deep Lipschitz networks and motivate future work on backward-gradient stabilization and architectural remedies (e.g., residual connections) to preserve signal for robust certification.

Abstract

This paper discusses the weight parametrization of two standard 1-Lipschitz network architectures, the Almost-Orthogonal-Layers (AOL) and the SDP-based Lipschitz Layers (SLL). It examines their impact on initialization for deep 1-Lipschitz feedforward networks, and discusses underlying issues surrounding this initialization. These networks are mainly used in certifiably robust classification applications to combat adversarial attacks by limiting the impact of perturbations on the classification output. Exact and upper bounds for the parameterized weight variance were calculated assuming a standard Normal distribution initialization; additionally, an upper bound was computed assuming a Generalized Normal Distribution, generalizing the proof for Uniform, Laplace, and Normal distribution weight initializations. It is demonstrated that the weight variance holds no bearing on the output variance distribution and that only the dimension of the weight matrices matters. Additionally, this paper demonstrates that the weight initialization always causes deep 1-Lipschitz networks to decay to zero.

Paper Structure

This paper contains 22 sections, 8 theorems, 94 equations, 11 figures.

Key Result

Theorem 1

Given an ReLU activation function, $\sigma(\cdot)$ the variance of the linear layer $y_l = \sigma(w_{l - 1} y_{l - 1} + b_{l - 1})$, where $\operatorname{\mathbb{E}}\left[w_{l - 1}\right] = \operatorname{\mathbb{E}}\left[b_{l - 1}\right] = 0$ and $y_{l - 1}$ is an unknown random variable has the fol

Figures (11)

  • Figure 1: Transformed Weight Variance Simulation
  • Figure 2: Forward Layer Activation Output
  • Figure 3: Forward Layer Activation Output Variances
  • Figure 4: Forward Layer Activation Output with Bias
  • Figure 5: Forward Layer Activation Output Variances with Bias
  • ...and 6 more figures

Theorems & Definitions (14)

  • Theorem 1
  • proof
  • Theorem 2
  • Theorem 3
  • Theorem 4
  • proof
  • Theorem 5
  • proof
  • Theorem 6: MGF of the Absolute PGND
  • proof
  • ...and 4 more