Table of Contents
Fetching ...

1-Lipschitz Neural Networks are more expressive with N-Activations

Bernd Prach, Christoph H. Lampert

TL;DR

The paper addresses robustness in deep networks by focusing on $1$-Lipschitz constraints and arguges that activation functions shape expressiveness. It establishes that common activations are not universal in 1D and introduces the $\mathcal{N}$-activation, proven universal for $1$-CPWL functions in 1D. The authors provide theoretical results showing the insufficiency of 2-piece activations and demonstrate, through toy and standard benchmarks, that $\mathcal{N}$-activation achieves universal representation and competitive certified robustness compared to MaxMin. Empirically, $\mathcal{N}$-activation networks fit target functions better and maintain comparable robust accuracy across CIFAR-10/100 and Tiny ImageNet, with careful initialization and training strategies. A public code release accompanies the work to enable adoption of the proposed activation in practical robust learning tasks.

Abstract

A crucial property for achieving secure, trustworthy and interpretable deep learning systems is their robustness: small changes to a system's inputs should not result in large changes to its outputs. Mathematically, this means one strives for networks with a small Lipschitz constant. Several recent works have focused on how to construct such Lipschitz networks, typically by imposing constraints on the weight matrices. In this work, we study an orthogonal aspect, namely the role of the activation function. We show that commonly used activation functions, such as MaxMin, as well as all piece-wise linear ones with two segments unnecessarily restrict the class of representable functions, even in the simplest one-dimensional setting. We furthermore introduce the new N-activation function that is provably more expressive than currently popular activation functions. We provide code at https://github.com/berndprach/NActivation.

1-Lipschitz Neural Networks are more expressive with N-Activations

TL;DR

The paper addresses robustness in deep networks by focusing on -Lipschitz constraints and arguges that activation functions shape expressiveness. It establishes that common activations are not universal in 1D and introduces the -activation, proven universal for -CPWL functions in 1D. The authors provide theoretical results showing the insufficiency of 2-piece activations and demonstrate, through toy and standard benchmarks, that -activation achieves universal representation and competitive certified robustness compared to MaxMin. Empirically, -activation networks fit target functions better and maintain comparable robust accuracy across CIFAR-10/100 and Tiny ImageNet, with careful initialization and training strategies. A public code release accompanies the work to enable adoption of the proposed activation in practical robust learning tasks.

Abstract

A crucial property for achieving secure, trustworthy and interpretable deep learning systems is their robustness: small changes to a system's inputs should not result in large changes to its outputs. Mathematically, this means one strives for networks with a small Lipschitz constant. Several recent works have focused on how to construct such Lipschitz networks, typically by imposing constraints on the weight matrices. In this work, we study an orthogonal aspect, namely the role of the activation function. We show that commonly used activation functions, such as MaxMin, as well as all piece-wise linear ones with two segments unnecessarily restrict the class of representable functions, even in the simplest one-dimensional setting. We furthermore introduce the new N-activation function that is provably more expressive than currently popular activation functions. We provide code at https://github.com/berndprach/NActivation.
Paper Structure (25 sections, 11 theorems, 27 equations, 6 figures, 5 tables)

This paper contains 25 sections, 11 theorems, 27 equations, 6 figures, 5 tables.

Key Result

Theorem 1

No 2-piece 1-CPWL activation is universal.

Figures (6)

  • Figure 1: A plot of the $\mathcal{N}$-function.
  • Figure 2: A plot of the $\mathcal{N}$-activation with parameters $\theta_1$ and $\theta_2$.
  • Figure 3: Mean squared error on the training set reported for $1$-Lipschitz AOL networks with different activation functions for fitting the $\mathcal{N}$-function.
  • Figure 4: ReLU networks, MaxMin networks and absolute value networks can not fit the $\mathcal{N}$-function, whereas $\mathcal{N}$-activation networks can!
  • Figure 5: Certified robust accuracy on different datasets, for different $1$-Lipschitz layers. MaxMin and $\mathcal{N}$-activation compared.
  • ...and 1 more figures

Theorems & Definitions (18)

  • Theorem 1
  • Corollary 1
  • Theorem 2
  • Lemma 1
  • proof
  • Theorem 2
  • proof
  • Corollary 1
  • proof
  • Theorem 2
  • ...and 8 more