Table of Contents
Fetching ...

Hard labels sampled from sparse targets mislead rotation invariant algorithms

Avrajit Ghosh, Bin Yu, Manfred Warmuth, Peter Bartlett

Abstract

One of the most common machine learning setups is logistic regression. In many classification models, including neural networks, the final prediction is obtained by applying a logistic link function to a linear score. In binary logistic regression, the feedback can be either soft labels, corresponding to the true conditional probability of the data (as in distillation), or sampled hard labels (taking values $\pm 1$). We point out a fundamental problem that arises even in a particularly favorable setting, where the goal is to learn a noise-free soft target of the form $σ(\mathbf{x}^{\top}\mathbf{w}^{\star})$. In the over-constrained case (i.e. the number of samples $n$ exceeds the input dimension $d$) with examples $(\mathbf{x}_i,σ(\mathbf{x}_i^{\top}\mathbf{w}^{\star}))$, it is sufficient to recover $\mathbf{w}^{\star}$ and hence achieve the Bayes risk. However, we prove that when the examples are labeled by hard labels $y_i$ sampled from the same conditional distribution $σ(\mathbf{x}_i^{\top}\mathbf{w}^{\star})$ and $\mathbf{w}^{\star}$ is $s$-sparse, then rotation-invariant algorithms are provably suboptimal: they incur an excess risk $Ω\!\left(\frac{d-1}{n}\right)$, while there are simple non-rotation invariant algorithms with excess risk $O(\frac{s\log d}{n})$. The simplest rotation invariant algorithm is gradient descent on the logistic loss (with early stopping). A simple non-rotation-invariant algorithm for sparse targets that achieves the above upper bounds uses gradient descent on the weights $u_i,v_i$, where now the linear weight $w_i$ is reparameterized as $u_iv_i$.

Hard labels sampled from sparse targets mislead rotation invariant algorithms

Abstract

One of the most common machine learning setups is logistic regression. In many classification models, including neural networks, the final prediction is obtained by applying a logistic link function to a linear score. In binary logistic regression, the feedback can be either soft labels, corresponding to the true conditional probability of the data (as in distillation), or sampled hard labels (taking values ). We point out a fundamental problem that arises even in a particularly favorable setting, where the goal is to learn a noise-free soft target of the form . In the over-constrained case (i.e. the number of samples exceeds the input dimension ) with examples , it is sufficient to recover and hence achieve the Bayes risk. However, we prove that when the examples are labeled by hard labels sampled from the same conditional distribution and is -sparse, then rotation-invariant algorithms are provably suboptimal: they incur an excess risk , while there are simple non-rotation invariant algorithms with excess risk . The simplest rotation invariant algorithm is gradient descent on the logistic loss (with early stopping). A simple non-rotation-invariant algorithm for sparse targets that achieves the above upper bounds uses gradient descent on the weights , where now the linear weight is reparameterized as .
Paper Structure (11 sections, 8 theorems, 47 equations, 10 figures)

This paper contains 11 sections, 8 theorems, 47 equations, 10 figures.

Key Result

Proposition 2.1

If the design matrix $\mathbf{X}:=(\mathbf{x}_{1},\mathbf{x}_{2},..,\mathbf{x}_{n})^T$ has full column rank and $n>d$, then the empirical soft-label risk $\widehat{\mathcal{L}}_{\mathrm{soft}}(\boldsymbol w)$eq:soft-risk-def has the same unique global minimizer $\mathbf w^\star$ as the population ri

Figures (10)

  • Figure 1: A sigmoided linear neuron (left) and its "spindlified" reparameterization (right). Gradient Descent (GD) on the left network is rotation invariant, whereas GD on the right network is not.
  • Figure 2: Gradient flow on single layer is rotation invariant. Rotating the data $\mathbf{x}_{i}$ (with labels unchanged), induces the same rotation on the estimator $\mathbf{\hat{w}}$.
  • Figure 5: Excess risk $\mathcal{L}(\hat{\mathbf{w}})-\mathcal{L}(\mathbf{w}^{*})$ plots averaged over sample draws of $\mathcal{D}$. The best estimator $\hat{\mathbf{w}}$ was obtained using early-stopping for both the algorithms.
  • Figure : (a) Training loss.
  • Figure : Single-layer
  • ...and 5 more figures

Theorems & Definitions (9)

  • Proposition 2.1
  • Proposition 2.2
  • Definition 3.1
  • Proposition 3.2
  • Theorem 3.3
  • Lemma 3.4
  • Theorem 3.5
  • Theorem 4.1
  • Theorem 5.1