Table of Contents
Fetching ...

Statistical Query Hardness of Multiclass Linear Classification with Random Classification Noise

Ilias Diakonikolas, Mingchen Ma, Lisheng Ren, Christos Tzamos

TL;DR

This work analyzes the computational limits of multiclass linear classification under Random Classification Noise (RCN) in the distribution-free PAC setting, revealing a sharp hardness jump for $k\ge 3$ labels. By embedding multiclass polynomial classification into linear classification via Veronese lifting and constructing hidden-direction distributions with carefully matched moments, the authors prove super-polynomial Statistical Query (SQ) lower bounds for learning with RCN, even for exact optimal error with $k=3$ and constant separation, and for constant-factor approximation when $k$ grows and the separation shrinks. The reductions hinge on a correlation-testing framework that translates SQ hardness in hypothesis testing into hardness in learning, using moment-matching univariate bases $A_i$ and disjoint-support constructions. The results imply a fundamental gap between the information-theoretic sample complexity and the computational complexity of learning in noisy multiclass settings, motivating exploration of structured noise and marginals in future work.

Abstract

We study the task of Multiclass Linear Classification (MLC) in the distribution-free PAC model with Random Classification Noise (RCN). Specifically, the learner is given a set of labeled examples $(x, y)$, where $x$ is drawn from an unknown distribution on $R^d$ and the labels are generated by a multiclass linear classifier corrupted with RCN. That is, the label $y$ is flipped from $i$ to $j$ with probability $H_{ij}$ according to a known noise matrix $H$ with non-negative separation $σ: = \min_{i \neq j} H_{ii}-H_{ij}$. The goal is to compute a hypothesis with small 0-1 error. For the special case of two labels, prior work has given polynomial-time algorithms achieving the optimal error. Surprisingly, little is known about the complexity of this task even for three labels. As our main contribution, we show that the complexity of MLC with RCN becomes drastically different in the presence of three or more labels. Specifically, we prove super-polynomial Statistical Query (SQ) lower bounds for this problem. In more detail, even for three labels and constant separation, we give a super-polynomial lower bound on the complexity of any SQ algorithm achieving optimal error. For a larger number of labels and smaller separation, we show a super-polynomial SQ lower bound even for the weaker goal of achieving any constant factor approximation to the optimal loss or even beating the trivial hypothesis.

Statistical Query Hardness of Multiclass Linear Classification with Random Classification Noise

TL;DR

This work analyzes the computational limits of multiclass linear classification under Random Classification Noise (RCN) in the distribution-free PAC setting, revealing a sharp hardness jump for labels. By embedding multiclass polynomial classification into linear classification via Veronese lifting and constructing hidden-direction distributions with carefully matched moments, the authors prove super-polynomial Statistical Query (SQ) lower bounds for learning with RCN, even for exact optimal error with and constant separation, and for constant-factor approximation when grows and the separation shrinks. The reductions hinge on a correlation-testing framework that translates SQ hardness in hypothesis testing into hardness in learning, using moment-matching univariate bases and disjoint-support constructions. The results imply a fundamental gap between the information-theoretic sample complexity and the computational complexity of learning in noisy multiclass settings, motivating exploration of structured noise and marginals in future work.

Abstract

We study the task of Multiclass Linear Classification (MLC) in the distribution-free PAC model with Random Classification Noise (RCN). Specifically, the learner is given a set of labeled examples , where is drawn from an unknown distribution on and the labels are generated by a multiclass linear classifier corrupted with RCN. That is, the label is flipped from to with probability according to a known noise matrix with non-negative separation . The goal is to compute a hypothesis with small 0-1 error. For the special case of two labels, prior work has given polynomial-time algorithms achieving the optimal error. Surprisingly, little is known about the complexity of this task even for three labels. As our main contribution, we show that the complexity of MLC with RCN becomes drastically different in the presence of three or more labels. Specifically, we prove super-polynomial Statistical Query (SQ) lower bounds for this problem. In more detail, even for three labels and constant separation, we give a super-polynomial lower bound on the complexity of any SQ algorithm achieving optimal error. For a larger number of labels and smaller separation, we show a super-polynomial SQ lower bound even for the weaker goal of achieving any constant factor approximation to the optimal loss or even beating the trivial hypothesis.

Paper Structure

This paper contains 25 sections, 23 theorems, 45 equations, 2 figures.

Key Result

Theorem 1.2

There is a noise matrix $H \in [0,1]^{3\times3}$ with $H_{ii}-H_{ij} \ge 0.1, \forall i \neq j \in [3]$, such that it is SQ-hard to learn an MLC problem on $\mathbb{R}^d$, with RCN specified by $H$, up to error $\mathrm{opt}+\epsilon$.

Figures (2)

  • Figure 1: Illustration of base distributions for $k=3$. Histograms that are colored in red (resp. blue) correspond to distribution $A_1$ (resp. $A_2$). $p_1,p_2,p_3$ colored in red, blue, and green are polynomials that characterize the target hypothesis $f^*$. $J_1$(resp. $J_2$) are red (resp. blue) intervals within the range $(-2\delta,2\delta)$, where examples have ground truth label $1$ (resp. $2$). Examples outside $J_1 \cup J_2$ have ground truth label $3$.
  • Figure 2: Illustration of $(D^{A,a}_v)_{\mid y=i}$ for $k=3$. $p_1,p_2,p_3$ colored in red, blue, green are polynomials that characterize the ground truth $f^*$. Histograms in red (resp. blue, green) correspond to distribution $(D^{A,a}_v)_{\mid y=1}$ (resp. $(D^{A,a}_v)_{\mid y=2}$, $(D^{A,a}_v)_{\mid y=3}$). For each $i$, $(D^{A,a}_v)_{\mid y=i}$ has many moments close to the moments of the standard normal.

Theorems & Definitions (45)

  • Definition 1.1: Multiclass Classification with RCN
  • Theorem 1.2: Informal Statement of \ref{['th additive']}
  • Theorem 1.3: Informal Statement of \ref{['cor approximation']}
  • Theorem 1.4: Informal Statement of \ref{['cor beat constant']}
  • Remark 1.5
  • Definition 3.1: SQ Model
  • Definition 3.2: Pairwise Correlation
  • Definition 4.1: SQ-Hard to Distinguish Condition
  • Definition 4.2: Correlation Testing Problem
  • Lemma 4.3
  • ...and 35 more