One-Bit Quantization and Sparsification for Multiclass Linear Classification with Strong Regularization

Reza Ghane; Danil Akhtiamov; Babak Hassibi

One-Bit Quantization and Sparsification for Multiclass Linear Classification with Strong Regularization

Reza Ghane, Danil Akhtiamov, Babak Hassibi

TL;DR

This work studies the use of linear regression for multiclass classification in the over-parametrized regime where some of the training data is mislabeled and proves that the best classification performance is achieved when f(\cdot) = \|\cdot\|^2_2$ and $\lambda \to \infty$.

Abstract

We study the use of linear regression for multiclass classification in the over-parametrized regime where some of the training data is mislabeled. In such scenarios it is necessary to add an explicit regularization term, $λf(w)$, for some convex function $f(\cdot)$, to avoid overfitting the mislabeled data. In our analysis, we assume that the data is sampled from a Gaussian Mixture Model with equal class sizes, and that a proportion $c$ of the training labels is corrupted for each class. Under these assumptions, we prove that the best classification performance is achieved when $f(\cdot) = \|\cdot\|^2_2$ and $λ\to \infty$. We then proceed to analyze the classification errors for $f(\cdot) = \|\cdot\|_1$ and $f(\cdot) = \|\cdot\|_\infty$ in the large $λ$ regime and notice that it is often possible to find sparse and one-bit solutions, respectively, that perform almost as well as the one corresponding to $f(\cdot) = \|\cdot\|_2^2$.

One-Bit Quantization and Sparsification for Multiclass Linear Classification with Strong Regularization

TL;DR

\lambda \to \infty$.

Abstract

, for some convex function

, to avoid overfitting the mislabeled data. In our analysis, we assume that the data is sampled from a Gaussian Mixture Model with equal class sizes, and that a proportion

of the training labels is corrupted for each class. Under these assumptions, we prove that the best classification performance is achieved when

and

. We then proceed to analyze the classification errors for

and

in the large

regime and notice that it is often possible to find sparse and one-bit solutions, respectively, that perform almost as well as the one corresponding to

Paper Structure (33 sections, 21 theorems, 149 equations, 7 figures)

This paper contains 33 sections, 21 theorems, 149 equations, 7 figures.

Introduction
Related works
Setup and preliminaries
The Gaussian mixture model with corruption
Why considering GMMs is not too limiting: Gaussian Universality
CGMT
Classification error for linear classifiers
Overview of approach
Main Results
Assumptions and notation
The optimal regularizer and $\lambda$
The Sandwich Theorem for multiclass linear classification in the large $\lambda$ regime
Main Theorems
Numerical Simulations
$f(\cdot) = \|\cdot\|_2^2$
...and 18 more sections

Key Result

Lemma 3.1

Let Where $\tilde{\mu}_\ell = \mu_\ell + \sqrt{\frac{k}{n}}\eta_l$ and $\tilde{A} \in \mathbb{R}^{(n - k - 2) \times d}, a, b, \eta_\ell \in \mathbb{R}^d$are i.i.d. $\mathscr{N}(0, \sigma^2)$ and $s, t$ are scalars defined in (eq: st). Assume that $X$ is distributed according to (eq:X). Then the distrib

Figures (7)

Figure 1: We took $d = 750$, $n = 500$, $k = 5$, $r = 0.8$, $c =0.3$ and $\sigma = 1$. The prediction underestimates the true error for smaller values of $\lambda$ but, as expected, matches it for larger ones.
Figure 2: We took $d = 750$, $n = 500$, $k = 5$, $r = 0.8$, $c =0.3$ and $\sigma = 1$ for these plots. They illustrate that for these parameters it is possible to sparsify the weights by 15X while keeping the classification error very low.
Figure 3: We took $d = 750$, $n = 500$, $k = 5$, $r = 0.8$, $c =0.3$ and $\sigma = 1$ for these plots. They illustrate that for these parameters it is possible to compress each weight to one bit while keeping the classification error very low.
Figure 4: We took $d = 1200$, $n = 300$, $k = 5$, $r = 0.7$, $c =0.2$, $\sigma = 1$ for the first plot and $d = 600$, $n = 300$, $k = 5$, $r = 0.7$, $c =0.2$, $\sigma = 1$ for the second. In both cases, the prediction underestimates the true error for the smaller values of $\lambda$ but matches it for the greater values of $\lambda$, as expected.
Figure 5: MNIST dataset. Optimal errors for both classifiers are approximately equal to $0.09$.
...and 2 more figures

Theorems & Definitions (34)

Lemma 3.1
Theorem 4.1
Theorem 4.2
Theorem 4.3
Theorem 4.4
Corollary 4.5
Theorem 4.6
Corollary 4.7
Lemma C.1
proof
...and 24 more

One-Bit Quantization and Sparsification for Multiclass Linear Classification with Strong Regularization

TL;DR

Abstract

One-Bit Quantization and Sparsification for Multiclass Linear Classification with Strong Regularization

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (34)