Table of Contents
Fetching ...

Complexity of Vector-valued Prediction: From Linear Models to Stochastic Convex Optimization

Matan Schliserman, Tomer Koren

TL;DR

The paper studies the complexity of learning vector-valued linear predictors with convex Lipschitz losses under a Frobenius-norm constraint, showing a tight ERM sample complexity of $\widetilde{\Theta}(k/\varepsilon^2)$ and revealing a deep connection between vector-valued prediction and stochastic convex optimization. It proves a lower bound via a shattering construction and provides a black-box SCO-to-VVP transformation that embeds any $d$-dimensional SCO instance into a VVP with $k=\Theta(d)$ outputs, translating SCO guarantees into VVP excess risk. This work positions VVP as an interpolation between generalized linear models ($k=1$) and SCO ($k=\Theta(d)$), offering a unified lens and practical reductions between these two dominant learning paradigms. The results advance understanding of when ERM suffices and how optimization frameworks can be transferred across settings, with implications for multi-output prediction and neural network contexts where layers can be viewed as vectors of outputs.

Abstract

We study the problem of learning vector-valued linear predictors: these are prediction rules parameterized by a matrix that maps an $m$-dimensional feature vector to a $k$-dimensional target. We focus on the fundamental case with a convex and Lipschitz loss function, and show several new theoretical results that shed light on the complexity of this problem and its connection to related learning models. First, we give a tight characterization of the sample complexity of Empirical Risk Minimization (ERM) in this setting, establishing that $\smash{\widetildeΩ}(k/ε^2)$ examples are necessary for ERM to reach $ε$ excess (population) risk; this provides for an exponential improvement over recent results by Magen and Shamir (2023) in terms of the dependence on the target dimension $k$, and matches a classical upper bound due to Maurer (2016). Second, we present a black-box conversion from general $d$-dimensional Stochastic Convex Optimization (SCO) to vector-valued linear prediction, showing that any SCO problem can be embedded as a prediction problem with $k=Θ(d)$ outputs. These results portray the setting of vector-valued linear prediction as bridging between two extensively studied yet disparate learning models: linear models (corresponds to $k=1$) and general $d$-dimensional SCO (with $k=Θ(d)$).

Complexity of Vector-valued Prediction: From Linear Models to Stochastic Convex Optimization

TL;DR

The paper studies the complexity of learning vector-valued linear predictors with convex Lipschitz losses under a Frobenius-norm constraint, showing a tight ERM sample complexity of and revealing a deep connection between vector-valued prediction and stochastic convex optimization. It proves a lower bound via a shattering construction and provides a black-box SCO-to-VVP transformation that embeds any -dimensional SCO instance into a VVP with outputs, translating SCO guarantees into VVP excess risk. This work positions VVP as an interpolation between generalized linear models () and SCO (), offering a unified lens and practical reductions between these two dominant learning paradigms. The results advance understanding of when ERM suffices and how optimization frameworks can be transferred across settings, with implications for multi-output prediction and neural network contexts where layers can be viewed as vectors of outputs.

Abstract

We study the problem of learning vector-valued linear predictors: these are prediction rules parameterized by a matrix that maps an -dimensional feature vector to a -dimensional target. We focus on the fundamental case with a convex and Lipschitz loss function, and show several new theoretical results that shed light on the complexity of this problem and its connection to related learning models. First, we give a tight characterization of the sample complexity of Empirical Risk Minimization (ERM) in this setting, establishing that examples are necessary for ERM to reach excess (population) risk; this provides for an exponential improvement over recent results by Magen and Shamir (2023) in terms of the dependence on the target dimension , and matches a classical upper bound due to Maurer (2016). Second, we present a black-box conversion from general -dimensional Stochastic Convex Optimization (SCO) to vector-valued linear prediction, showing that any SCO problem can be embedded as a prediction problem with outputs. These results portray the setting of vector-valued linear prediction as bridging between two extensively studied yet disparate learning models: linear models (corresponds to ) and general -dimensional SCO (with ).

Paper Structure

This paper contains 18 sections, 10 theorems, 43 equations.

Key Result

theorem 1

Let $k$,$n\in \mathbb{N}$. There exist $m=\Theta(n)$, a reference matrix $W_0\in \mathbb{R}^{k\times m}$, a convex and $1$-Lipschitz loss function $\ell\in \mathbb{R}^k\to \mathbb{R}$ and a distribution $\mathcal{D}$ such that in the VVP parameterized by $W_0,\mathcal{D}$ and $\ell$, with constant p

Theorems & Definitions (18)

  • theorem 1
  • Lemma 1
  • Proof : of \ref{['lower_bound_shattering']}
  • theorem 2
  • Lemma 2
  • Proof : of \ref{['scotopredreduction']}
  • theorem 3
  • Lemma 3: maurer2016vector, Corollary 4
  • Lemma 4
  • Proof
  • ...and 8 more