Table of Contents
Fetching ...

Convexified Convolutional Neural Networks

Yuchen Zhang, Percy Liang, Martin J. Wainwright

TL;DR

This work introduces convexified convolutional neural networks (CCNNs) that preserve CNN-style parameter sharing while enabling convex optimization through a nuclear-norm low-rank relaxation and an RKHS-based nonlinear filter representation. For two-layer CCNNs, the authors prove a generalization bound showing the CCNN risk approaches the best possible two-layer CNN risk, with a favorable sample complexity due to sharing. They extend the approach to deeper networks via layer-wise training and demonstrate competitive performance on MNIST variants and CIFAR-10, often surpassing traditional CNNs and several nonconvolutional baselines. The results suggest convex relaxations can yield both scalable training and rigorous generalization guarantees for CNN-like architectures, while identifying open directions for formalizing deep CCNNs.

Abstract

We describe the class of convexified convolutional neural networks (CCNNs), which capture the parameter sharing of convolutional neural networks in a convex manner. By representing the nonlinear convolutional filters as vectors in a reproducing kernel Hilbert space, the CNN parameters can be represented as a low-rank matrix, which can be relaxed to obtain a convex optimization problem. For learning two-layer convolutional neural networks, we prove that the generalization error obtained by a convexified CNN converges to that of the best possible CNN. For learning deeper networks, we train CCNNs in a layer-wise manner. Empirically, CCNNs achieve performance competitive with CNNs trained by backpropagation, SVMs, fully-connected neural networks, stacked denoising auto-encoders, and other baseline methods.

Convexified Convolutional Neural Networks

TL;DR

This work introduces convexified convolutional neural networks (CCNNs) that preserve CNN-style parameter sharing while enabling convex optimization through a nuclear-norm low-rank relaxation and an RKHS-based nonlinear filter representation. For two-layer CCNNs, the authors prove a generalization bound showing the CCNN risk approaches the best possible two-layer CNN risk, with a favorable sample complexity due to sharing. They extend the approach to deeper networks via layer-wise training and demonstrate competitive performance on MNIST variants and CIFAR-10, often surpassing traditional CNNs and several nonconvolutional baselines. The results suggest convex relaxations can yield both scalable training and rigorous generalization guarantees for CNN-like architectures, while identifying open directions for formalizing deep CCNNs.

Abstract

We describe the class of convexified convolutional neural networks (CCNNs), which capture the parameter sharing of convolutional neural networks in a convex manner. By representing the nonlinear convolutional filters as vectors in a reproducing kernel Hilbert space, the CNN parameters can be represented as a low-rank matrix, which can be relaxed to obtain a convex optimization problem. For learning two-layer convolutional neural networks, we prove that the generalization error obtained by a convexified CNN converges to that of the best possible CNN. For learning deeper networks, we train CCNNs in a layer-wise manner. Empirically, CCNNs achieve performance competitive with CNNs trained by backpropagation, SVMs, fully-connected neural networks, stacked denoising auto-encoders, and other baseline methods.

Paper Structure

This paper contains 36 sections, 6 theorems, 60 equations, 5 figures, 3 tables, 2 algorithms.

Key Result

Theorem 1

Assume that the loss function $\mathcal{L}(\cdot;y)$ is $L$-Lipchitz continuous for every $y\in[d_2]$ and that $\mathcal{K}$ is the inverse polynomial kernel or the Gaussian kernel. For any valid activation function $\sigma$, there is a constant $C_\sigma(B_1)$ such that with the radius $R := C_\sig where $c > 0$ is a universal constant.

Figures (5)

  • Figure 1: The $k^{th}$ output of a CNN $f_k(x) \in \mathbb{R}$ can be expressed as the product between a matrix $Z(x) \in \mathbb{R}^{P \times d_1}$ whose rows are features of the input patches and a rank-$r$ matrix $A_k \in \mathbb{R}^{d_1 \times P}$, which is made up of the filter weights $\{w_j\}$ and coefficients $\{a_{k,j,p}\}$, as illustrated. Due to the parameter sharing intrinsic to CNNs, the matrix $A_k$ inherits a low rank structure, which can be enforced via convex relaxation using the nuclear norm.
  • Figure 2: Comparing different activation functions. The two functions in (a) are quite similar. The smooth hinge loss in (b) is a smoothed version of ReLU.
  • Figure 3: Some variations of the MNIST dataset: (a) random background inserted into the digit image; (b) digits rotated by a random angle generated uniformly between $0$ and $2\pi$; (c) black and white image used as the background for the digit image; (d) combination of background perturbation and rotation perturbation.
  • Figure 4: Classification error on the CIFAR-10 dataset. The best performance within each block is bolded.
  • Figure 5: The convergence of CNN-3 and CCNN-3 on the CIFAR-10 dataset.

Theorems & Definitions (8)

  • Theorem 1
  • Lemma 1
  • proof
  • Lemma 2
  • proof
  • Lemma 3
  • Lemma 4
  • Lemma 5