Table of Contents
Fetching ...

One-Bit Quantization and Sparsification for Multiclass Linear Classification with Strong Regularization

Reza Ghane, Danil Akhtiamov, Babak Hassibi

TL;DR

This work studies the use of linear regression for multiclass classification in the over-parametrized regime where some of the training data is mislabeled and proves that the best classification performance is achieved when f(\cdot) = \|\cdot\|^2_2$ and $\lambda \to \infty$.

Abstract

We study the use of linear regression for multiclass classification in the over-parametrized regime where some of the training data is mislabeled. In such scenarios it is necessary to add an explicit regularization term, $λf(w)$, for some convex function $f(\cdot)$, to avoid overfitting the mislabeled data. In our analysis, we assume that the data is sampled from a Gaussian Mixture Model with equal class sizes, and that a proportion $c$ of the training labels is corrupted for each class. Under these assumptions, we prove that the best classification performance is achieved when $f(\cdot) = \|\cdot\|^2_2$ and $λ\to \infty$. We then proceed to analyze the classification errors for $f(\cdot) = \|\cdot\|_1$ and $f(\cdot) = \|\cdot\|_\infty$ in the large $λ$ regime and notice that it is often possible to find sparse and one-bit solutions, respectively, that perform almost as well as the one corresponding to $f(\cdot) = \|\cdot\|_2^2$.

One-Bit Quantization and Sparsification for Multiclass Linear Classification with Strong Regularization

TL;DR

This work studies the use of linear regression for multiclass classification in the over-parametrized regime where some of the training data is mislabeled and proves that the best classification performance is achieved when f(\cdot) = \|\cdot\|^2_2\lambda \to \infty$.

Abstract

We study the use of linear regression for multiclass classification in the over-parametrized regime where some of the training data is mislabeled. In such scenarios it is necessary to add an explicit regularization term, , for some convex function , to avoid overfitting the mislabeled data. In our analysis, we assume that the data is sampled from a Gaussian Mixture Model with equal class sizes, and that a proportion of the training labels is corrupted for each class. Under these assumptions, we prove that the best classification performance is achieved when and . We then proceed to analyze the classification errors for and in the large regime and notice that it is often possible to find sparse and one-bit solutions, respectively, that perform almost as well as the one corresponding to .
Paper Structure (33 sections, 21 theorems, 149 equations, 7 figures)

This paper contains 33 sections, 21 theorems, 149 equations, 7 figures.

Key Result

Lemma 3.1

Let Where $\tilde{\mu}_\ell = \mu_\ell + \sqrt{\frac{k}{n}}\eta_l$ and $\tilde{A} \in \mathbb{R}^{(n - k - 2) \times d}, a, b, \eta_\ell \in \mathbb{R}^d$are i.i.d. $\mathscr{N}(0, \sigma^2)$ and $s, t$ are scalars defined in (eq: st). Assume that $X$ is distributed according to (eq:X). Then the distrib

Figures (7)

  • Figure 1: We took $d = 750$, $n = 500$, $k = 5$, $r = 0.8$, $c =0.3$ and $\sigma = 1$. The prediction underestimates the true error for smaller values of $\lambda$ but, as expected, matches it for larger ones.
  • Figure 2: We took $d = 750$, $n = 500$, $k = 5$, $r = 0.8$, $c =0.3$ and $\sigma = 1$ for these plots. They illustrate that for these parameters it is possible to sparsify the weights by 15X while keeping the classification error very low.
  • Figure 3: We took $d = 750$, $n = 500$, $k = 5$, $r = 0.8$, $c =0.3$ and $\sigma = 1$ for these plots. They illustrate that for these parameters it is possible to compress each weight to one bit while keeping the classification error very low.
  • Figure 4: We took $d = 1200$, $n = 300$, $k = 5$, $r = 0.7$, $c =0.2$, $\sigma = 1$ for the first plot and $d = 600$, $n = 300$, $k = 5$, $r = 0.7$, $c =0.2$, $\sigma = 1$ for the second. In both cases, the prediction underestimates the true error for the smaller values of $\lambda$ but matches it for the greater values of $\lambda$, as expected.
  • Figure 5: MNIST dataset. Optimal errors for both classifiers are approximately equal to $0.09$.
  • ...and 2 more figures

Theorems & Definitions (34)

  • Lemma 3.1
  • Theorem 4.1
  • Theorem 4.2
  • Theorem 4.3
  • Theorem 4.4
  • Corollary 4.5
  • Theorem 4.6
  • Corollary 4.7
  • Lemma C.1
  • proof
  • ...and 24 more