Hard labels sampled from sparse targets mislead rotation invariant algorithms

Avrajit Ghosh; Bin Yu; Manfred Warmuth; Peter Bartlett

Hard labels sampled from sparse targets mislead rotation invariant algorithms

Avrajit Ghosh, Bin Yu, Manfred Warmuth, Peter Bartlett

Abstract

One of the most common machine learning setups is logistic regression. In many classification models, including neural networks, the final prediction is obtained by applying a logistic link function to a linear score. In binary logistic regression, the feedback can be either soft labels, corresponding to the true conditional probability of the data (as in distillation), or sampled hard labels (taking values $\pm 1$). We point out a fundamental problem that arises even in a particularly favorable setting, where the goal is to learn a noise-free soft target of the form $σ(\mathbf{x}^{\top}\mathbf{w}^{\star})$. In the over-constrained case (i.e. the number of samples $n$ exceeds the input dimension $d$) with examples $(\mathbf{x}_i,σ(\mathbf{x}_i^{\top}\mathbf{w}^{\star}))$, it is sufficient to recover $\mathbf{w}^{\star}$ and hence achieve the Bayes risk. However, we prove that when the examples are labeled by hard labels $y_i$ sampled from the same conditional distribution $σ(\mathbf{x}_i^{\top}\mathbf{w}^{\star})$ and $\mathbf{w}^{\star}$ is $s$-sparse, then rotation-invariant algorithms are provably suboptimal: they incur an excess risk $Ω\!\left(\frac{d-1}{n}\right)$, while there are simple non-rotation invariant algorithms with excess risk $O(\frac{s\log d}{n})$. The simplest rotation invariant algorithm is gradient descent on the logistic loss (with early stopping). A simple non-rotation-invariant algorithm for sparse targets that achieves the above upper bounds uses gradient descent on the weights $u_i,v_i$, where now the linear weight $w_i$ is reparameterized as $u_iv_i$.

Hard labels sampled from sparse targets mislead rotation invariant algorithms

Abstract

). We point out a fundamental problem that arises even in a particularly favorable setting, where the goal is to learn a noise-free soft target of the form

. In the over-constrained case (i.e. the number of samples

exceeds the input dimension

) with examples

, it is sufficient to recover

and hence achieve the Bayes risk. However, we prove that when the examples are labeled by hard labels

sampled from the same conditional distribution

and

-sparse, then rotation-invariant algorithms are provably suboptimal: they incur an excess risk

, while there are simple non-rotation invariant algorithms with excess risk

. The simplest rotation invariant algorithm is gradient descent on the logistic loss (with early stopping). A simple non-rotation-invariant algorithm for sparse targets that achieves the above upper bounds uses gradient descent on the weights

, where now the linear weight

is reparameterized as

Paper Structure (11 sections, 8 theorems, 47 equations, 10 figures)

This paper contains 11 sections, 8 theorems, 47 equations, 10 figures.

Introduction
Preliminaries
Lower Bound for Rotation-Invariant Algorithms
Reduction to a Rotationally Symmetrized Observation Model
Excess risk on posterior over uniform sphere:
Spindly Dynamics in Logistic Regression
Upper bound Excess Risk for the Spindly Network
Numerical Experiments
Related works
Conclusion and Open Problems
Acknowledgements

Key Result

Proposition 2.1

If the design matrix $\mathbf{X}:=(\mathbf{x}_{1},\mathbf{x}_{2},..,\mathbf{x}_{n})^T$ has full column rank and $n>d$, then the empirical soft-label risk $\widehat{\mathcal{L}}_{\mathrm{soft}}(\boldsymbol w)$eq:soft-risk-def has the same unique global minimizer $\mathbf w^\star$ as the population ri

Figures (10)

Figure 1: A sigmoided linear neuron (left) and its "spindlified" reparameterization (right). Gradient Descent (GD) on the left network is rotation invariant, whereas GD on the right network is not.
Figure 2: Gradient flow on single layer is rotation invariant. Rotating the data $\mathbf{x}_{i}$ (with labels unchanged), induces the same rotation on the estimator $\mathbf{\hat{w}}$.
Figure 5: Excess risk $\mathcal{L}(\hat{\mathbf{w}})-\mathcal{L}(\mathbf{w}^{*})$ plots averaged over sample draws of $\mathcal{D}$. The best estimator $\hat{\mathbf{w}}$ was obtained using early-stopping for both the algorithms.
Figure : (a) Training loss.
Figure : Single-layer
...and 5 more figures

Theorems & Definitions (9)

Proposition 2.1
Proposition 2.2
Definition 3.1
Proposition 3.2
Theorem 3.3
Lemma 3.4
Theorem 3.5
Theorem 4.1
Theorem 5.1

Hard labels sampled from sparse targets mislead rotation invariant algorithms

Abstract

Hard labels sampled from sparse targets mislead rotation invariant algorithms

Authors

Abstract

Table of Contents

Key Result

Figures (10)

Theorems & Definitions (9)