Deep Learning using Linear Support Vector Machines
Yichuan Tang
TL;DR
The paper questions the default use of softmax cross-entropy in deep classifiers and proposes a linear L2-SVM top layer that can be trained end-to-end via backpropagation. Across MNIST, CIFAR-10, and a facial expression recognition task, the DLSVM top layer yields consistent accuracy gains, attributed to the margin-based regularization of the SVM loss rather than optimization tweaks alone. The results show notable improvements (e.g., MNIST 0.87% vs 0.99% error; CIFAR-10 11.9% vs 14.0% error) and competitive facial expression recognition scores, suggesting that switching to an SVM top layer is a simple, effective alternative for discriminative deep models. The authors also analyze the regularization vs optimization dynamics, indicating that the margin-based objective plays a key role in the observed gains.
Abstract
Recently, fully-connected and convolutional neural networks have been trained to achieve state-of-the-art performance on a wide variety of tasks such as speech recognition, image classification, natural language processing, and bioinformatics. For classification tasks, most of these "deep learning" models employ the softmax activation function for prediction and minimize cross-entropy loss. In this paper, we demonstrate a small but consistent advantage of replacing the softmax layer with a linear support vector machine. Learning minimizes a margin-based loss instead of the cross-entropy loss. While there have been various combinations of neural nets and SVMs in prior art, our results using L2-SVMs show that by simply replacing softmax with linear SVMs gives significant gains on popular deep learning datasets MNIST, CIFAR-10, and the ICML 2013 Representation Learning Workshop's face expression recognition challenge.
