Table of Contents
Fetching ...

Multi-Objective Linear Ensembles for Robust and Sparse Training of Few-Bit Neural Networks

Ambrogio Maria Bernardelli, Stefano Gualandi, Hoong Chuin Lau, Simone Milanesi, Neil Yorke-Smith

TL;DR

A multiobjective ensemble approach based on training a single NN for each possible pair of classes and applying a majority voting scheme to predict the final output results in the training of robust sparsified networks whose output is not affected by small perturbations on the input and whose number of active weights is as small as possible.

Abstract

Training neural networks (NNs) using combinatorial optimization solvers has gained attention in recent years. In low-data settings, state-of-the-art mixed integer linear programming solvers can train exactly a NN, avoiding intensive GPU-based training and hyper-parameter tuning and simultaneously training and sparsifying the network. We study the case of few-bit discrete-valued neural networks, both Binarized Neural Networks (BNNs), whose values are restricted to +-1, and Integer Neural Networks (INNs), whose values lie in a range {-P, ..., P}. Few-bit NNs receive increasing recognition due to their lightweight architecture and ability to run on low-power devices. This paper proposes new methods to improve the training of BNNs and INNs. Our contribution is a multi-objective ensemble approach based on training a single NN for each possible pair of classes and applying a majority voting scheme to predict the final output. Our approach results in training robust sparsified networks whose output is not affected by small perturbations on the input and whose number of active weights is as small as possible. We compare this BeMi approach to the current state-of-the-art in solver-based NN training and gradient-based training, focusing on BNN learning in few-shot contexts. We compare the benefits and drawbacks of INNs versus BNNs, bringing new light to the distribution of weights over the {-P, ..., P} interval. Finally, we compare multi-objective versus single-objective training of INNs, showing that robustness and network simplicity can be acquired simultaneously, thus obtaining better test performances. While the previous state-of-the-art approaches achieve an average accuracy of 51.1% on the MNIST dataset, the BeMi ensemble approach achieves an average accuracy of 68.4% when trained with 10 images per class and 81.8% when trained with 40 images per class, having up to 75.3% NN links removed.

Multi-Objective Linear Ensembles for Robust and Sparse Training of Few-Bit Neural Networks

TL;DR

A multiobjective ensemble approach based on training a single NN for each possible pair of classes and applying a majority voting scheme to predict the final output results in the training of robust sparsified networks whose output is not affected by small perturbations on the input and whose number of active weights is as small as possible.

Abstract

Training neural networks (NNs) using combinatorial optimization solvers has gained attention in recent years. In low-data settings, state-of-the-art mixed integer linear programming solvers can train exactly a NN, avoiding intensive GPU-based training and hyper-parameter tuning and simultaneously training and sparsifying the network. We study the case of few-bit discrete-valued neural networks, both Binarized Neural Networks (BNNs), whose values are restricted to +-1, and Integer Neural Networks (INNs), whose values lie in a range {-P, ..., P}. Few-bit NNs receive increasing recognition due to their lightweight architecture and ability to run on low-power devices. This paper proposes new methods to improve the training of BNNs and INNs. Our contribution is a multi-objective ensemble approach based on training a single NN for each possible pair of classes and applying a majority voting scheme to predict the final output. Our approach results in training robust sparsified networks whose output is not affected by small perturbations on the input and whose number of active weights is as small as possible. We compare this BeMi approach to the current state-of-the-art in solver-based NN training and gradient-based training, focusing on BNN learning in few-shot contexts. We compare the benefits and drawbacks of INNs versus BNNs, bringing new light to the distribution of weights over the {-P, ..., P} interval. Finally, we compare multi-objective versus single-objective training of INNs, showing that robustness and network simplicity can be acquired simultaneously, thus obtaining better test performances. While the previous state-of-the-art approaches achieve an average accuracy of 51.1% on the MNIST dataset, the BeMi ensemble approach achieves an average accuracy of 68.4% when trained with 10 images per class and 81.8% when trained with 40 images per class, having up to 75.3% NN links removed.
Paper Structure (38 sections, 18 equations, 6 figures, 5 tables)

This paper contains 38 sections, 18 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: The SM+MM+MW achieve the same accuracy of the SM+MM model, outperforming the SM model, while having the smallest percentage of non-zero weights, a part from the SM+MW model, which has an almost-zero percentage of non-zero weights, but also a lower accuracy of the models that maximize the margins. The dotted lines represent the average accuracy obtained over 5 instances, while the shaded areas highlight the range between the minimum and maximum values.
  • Figure 2: Comparison of published approaches vs BeMi, in terms of accuracy over the MNIST dataset using few-shot learning with 2, 6, and 10 images per digit.
  • Figure 3: Average accuracy for the BeMi ensemble tested on two architectures, namely $\mathcal{N}_a=[784,4,4,1]$ and $\mathcal{N}_b=[784,10,3,1]$, using $10, 20, 30, 40$ images per class.
  • Figure 4: Comparison of accuracy for different values of $P$. Note how using an exponentially larger research space, namely, using values of $P$ greater than $1$, does not improve the accuracy.
  • Figure 5: Confusion matrix of the BeMi ensemble trained with $40$ images per digits with the architecture $[784, 4, 4, 1]$. We reported the percentages for each couple. Note that the $n.l.$ indicate the unclassified images.
  • ...and 1 more figures

Theorems & Definitions (6)

  • Definition 1: Dominant label
  • Definition 2: Label statuses
  • Example 1
  • Example 2
  • Definition 3: Dominant label
  • Definition 4: Label statuses