Table of Contents
Fetching ...

Regularisation of Neural Networks by Enforcing Lipschitz Continuity

Henry Gouk, Eibe Frank, Bernhard Pfahringer, Michael J. Cree

TL;DR

The paper addresses the problem of improving generalisation by enforcing Lipschitz continuity in neural networks. It develops a practical framework to compute per-layer Lipschitz bounds for common layers under $p$-norms and trains with a hard constraint via a projection step, enabling a network-wide bound $L(f) \le \lambda^d$. Key contributions include exact or efficient calculations of per-layer operator norms for $p \in \{1,2,\infty\}$, a projection-based training algorithm, and extensive experiments across CIFAR-10/100, MNIST/Fashion-MNIST, SVHN, SINS-10, and tabular data demonstrating data-efficient improvements and insights into norm choice. The results suggest Lipschitz-constrained networks offer a principled regularisation that can improve generalisation, especially with limited data, and open avenues for applying such constraints to GANs and recurrent models.

Abstract

We investigate the effect of explicitly enforcing the Lipschitz continuity of neural networks with respect to their inputs. To this end, we provide a simple technique for computing an upper bound to the Lipschitz constant---for multiple $p$-norms---of a feed forward neural network composed of commonly used layer types. Our technique is then used to formulate training a neural network with a bounded Lipschitz constant as a constrained optimisation problem that can be solved using projected stochastic gradient methods. Our evaluation study shows that the performance of the resulting models exceeds that of models trained with other common regularisers. We also provide evidence that the hyperparameters are intuitive to tune, demonstrate how the choice of norm for computing the Lipschitz constant impacts the resulting model, and show that the performance gains provided by our method are particularly noticeable when only a small amount of training data is available.

Regularisation of Neural Networks by Enforcing Lipschitz Continuity

TL;DR

The paper addresses the problem of improving generalisation by enforcing Lipschitz continuity in neural networks. It develops a practical framework to compute per-layer Lipschitz bounds for common layers under -norms and trains with a hard constraint via a projection step, enabling a network-wide bound . Key contributions include exact or efficient calculations of per-layer operator norms for , a projection-based training algorithm, and extensive experiments across CIFAR-10/100, MNIST/Fashion-MNIST, SVHN, SINS-10, and tabular data demonstrating data-efficient improvements and insights into norm choice. The results suggest Lipschitz-constrained networks offer a principled regularisation that can improve generalisation, especially with limited data, and open avenues for applying such constraints to GANs and recurrent models.

Abstract

We investigate the effect of explicitly enforcing the Lipschitz continuity of neural networks with respect to their inputs. To this end, we provide a simple technique for computing an upper bound to the Lipschitz constant---for multiple -norms---of a feed forward neural network composed of commonly used layer types. Our technique is then used to formulate training a neural network with a bounded Lipschitz constant as a constrained optimisation problem that can be solved using projected stochastic gradient methods. Our evaluation study shows that the performance of the resulting models exceeds that of models trained with other common regularisers. We also provide evidence that the hyperparameters are intuitive to tune, demonstrate how the choice of norm for computing the Lipschitz constant impacts the resulting model, and show that the performance gains provided by our method are particularly noticeable when only a small amount of training data is available.

Paper Structure

This paper contains 21 sections, 26 equations, 4 figures, 7 tables, 2 algorithms.

Figures (4)

  • Figure 1: A critical difference diagram showing the statistically significant (95% confidence) differences between the average rank of each method. The number beside each method is the average rank of that method across all datasets. The thick black bars overlaid on groups of thin black lines indicate a clique of methods that have not been found to be statistically significantly different.
  • Figure 2: This figure demonstrates the sensitivity of the algorithm to the choice of $\lambda$ for each of the three $p$-norms when used to regularise VGG19 networks trained on the CIFAR-100 dataset. Because a different hyperparameter was optimised for each layer type, the horizontal axis represents the value of a single constant that is used to scale the three different $\lambda$ hyperparameters associated with each curve. Note that when $c=0.6$, the LCC-$\ell_1$ network fails to converge.
  • Figure 3: Learning curves for VGG-style networks trained on CIFAR-10 with each of the regularisation methods.
  • Figure 4: Learning curves for wide residual networks trained on CIFAR-10 with each of the regularisation methods.