Table of Contents
Fetching ...

How to Boost Any Loss Function

Richard Nock, Yishay Mansour

TL;DR

Any loss function can be optimized with boosting and thus boosting can achieve a feat not yet known to be possible in the classical $0^{th}$ order setting, since loss functions are not required to be be convex, nor differentiable or Lipschitz -- and in fact not required to be continuous either.

Abstract

Boosting is a highly successful ML-born optimization setting in which one is required to computationally efficiently learn arbitrarily good models based on the access to a weak learner oracle, providing classifiers performing at least slightly differently from random guessing. A key difference with gradient-based optimization is that boosting's original model does not requires access to first order information about a loss, yet the decades long history of boosting has quickly evolved it into a first order optimization setting -- sometimes even wrongfully defining it as such. Owing to recent progress extending gradient-based optimization to use only a loss' zeroth ($0^{th}$) order information to learn, this begs the question: what loss functions can be efficiently optimized with boosting and what is the information really needed for boosting to meet the original boosting blueprint's requirements? We provide a constructive formal answer essentially showing that any loss function can be optimized with boosting and thus boosting can achieve a feat not yet known to be possible in the classical $0^{th}$ order setting, since loss functions are not required to be be convex, nor differentiable or Lipschitz -- and in fact not required to be continuous either. Some tools we use are rooted in quantum calculus, the mathematical field -- not to be confounded with quantum computation -- that studies calculus without passing to the limit, and thus without using first order information.

How to Boost Any Loss Function

TL;DR

Any loss function can be optimized with boosting and thus boosting can achieve a feat not yet known to be possible in the classical order setting, since loss functions are not required to be be convex, nor differentiable or Lipschitz -- and in fact not required to be continuous either.

Abstract

Boosting is a highly successful ML-born optimization setting in which one is required to computationally efficiently learn arbitrarily good models based on the access to a weak learner oracle, providing classifiers performing at least slightly differently from random guessing. A key difference with gradient-based optimization is that boosting's original model does not requires access to first order information about a loss, yet the decades long history of boosting has quickly evolved it into a first order optimization setting -- sometimes even wrongfully defining it as such. Owing to recent progress extending gradient-based optimization to use only a loss' zeroth () order information to learn, this begs the question: what loss functions can be efficiently optimized with boosting and what is the information really needed for boosting to meet the original boosting blueprint's requirements? We provide a constructive formal answer essentially showing that any loss function can be optimized with boosting and thus boosting can achieve a feat not yet known to be possible in the classical order setting, since loss functions are not required to be be convex, nor differentiable or Lipschitz -- and in fact not required to be continuous either. Some tools we use are rooted in quantum calculus, the mathematical field -- not to be confounded with quantum computation -- that studies calculus without passing to the limit, and thus without using first order information.
Paper Structure (31 sections, 12 theorems, 67 equations, 12 figures, 1 table, 3 algorithms)

This paper contains 31 sections, 12 theorems, 67 equations, 12 figures, 1 table, 3 algorithms.

Key Result

Lemma 4.5

Suppose $F$ strictly convex differentiable. Then $\lim_{v\rightarrow 0} S_{F|v}(z'\|z) = D_F(z'\| F'(z))$.

Figures (12)

  • Figure 1: Left: value of $S_{F|v}( z'\|z)$ for convex $F$, $v \stackrel{\mathrm{.}}{=} z_4 - z$ and various $z'$ (colors), for which the Bregman Secant distortion is positive ($z'=z_1$, green), negative ($z'=z_2$, red), minimal ($z'=z_3$) or null ($z'=z_4, z$). Right: depiction of $Q_{F}(z,z+v, z')$ for non-convex $F$ (Definition \ref{['defBI']}).
  • Figure 2: Simplified depiction of $\overline{W}_{2,t}$ "regimes" (Assumption \ref{['assum2cr']}). We only plot the components of the $v$-derivative part in \ref{['boundW2']}: removing index $i$ for readability, we get $\updelta_{\{e_{t},v_{t-1}\}}F(\tilde{e}_{t-1}) = (B_t - A_t) / (y H_t(\bm{x})-y H_{t-1}(\bm{x}))$ with $A_t \stackrel{\mathrm{.}}{=} \updelta_{v_{t-1}}F(y H_{t-1}(\bm{x})) = - w_t$ and $B_t \stackrel{\mathrm{.}}{=} \updelta_{v_{t-1}}F(y H_{t}(\bm{x}))$ ($= - w_{t+1}$ iff $v_{t-1} = v_{t}$). If the loss is "nice" like the exponential or logistic losses, we always have a small $\overline{W}_{2,t}$ (a). Place a bump in the loss (b-d) and the risk happens that $\overline{W}_{2,t}$ is too large for the WRA to hold. Workarounds include two strategies: picking small enough offsets (b) or fit offsets large enough to pass the bump (c). The blue arrow in (d) is discussed in Section \ref{['sec:disc']}.
  • Figure 3: A simple way to build $\mathbb{I}_{ti}(z)$ for a discontinuous loss $F$ ($\tilde{e}_{ti}<\tilde{e}_{(t-1)i}$ and $z$ are represented), $\mathcal{O}$ being the set of solutions as it is built. We rotate two half-lines, one passing through $(\tilde{e}_{ti},F(\tilde{e}_{ti}))$ (thick line, $(\Delta)$) and a parallel one translated by $-z$ (dashed line) (a). As soon as $(\Delta)$ crosses $F$ on any point $(z',F(z'))$ with $z\neq \tilde{e}_{ti}$ while the dashed line stays below $F$, we obtain a candidate offset $v$ for oo, namely $v = z' - \tilde{e}_{ti}$. In (b), we obtain an interval of values. We keep on rotating $(\Delta)$, eventually making appear several intervals for the choice of $v$ if $F$ is not convex (c). Finally, when we reach an angle such that the maximal difference between $(\Delta)$ and $F$ in $[\tilde{e}_{ti},\tilde{e}_{(t-1)i}]$ is $z$ ($z$ can be located at an intersection between $F$ and the dashed line), we stop and obtain the full $\mathbb{I}_{ti}(z)$ (d).
  • Figure 4: More examples of ensembles $\mathbb{I}_{ti}(z)$ (in blue) for the $F$ in Figure \ref{['f-Iti-const']}. (a): $\mathbb{I}_{ti}(z)$ is the union of two intervals with all candidate offsets non negative. (b): it is a single interval with non-positive offsets. (c): at a discontinuity, if $z$ is smaller than the discontinuity, we have no direct solution for $\mathbb{I}_{ti}(z)$ for at least one positioning of the edges, but a simple trick bypasses the difficulty (see text).
  • Figure 5: Left: representation of the difference of averages in \ref{['eqDIFFAVG']}. Each of the secants $(\Delta_1)$ and $(\Delta_2)$ can take either the red or black segment. Which one is which depends on the signs of $c$ and $b$, but the general configuration is always the same. Note that if $F$ is convex, one necessarily sits above the other, which is the crux of the proof of Lemma \ref{['lemW2bound']}. For the sake of illustration, suppose we can analytically have $b,c \rightarrow 0$. As $c$ converges to 0 but $b$ remains $>0$, $\updelta_{\{b, c\}}F(a)$ becomes proportional to the variation of the average secant midpoint; the then-convergence of $b$ to 0 makes $\updelta_{\{b, c\}}F(a)$ converge to the second-order derivative of $F$ at $a$. Right: in the special case where $F$ is convex, one of the secants always sits above the other.
  • ...and 7 more figures

Theorems & Definitions (21)

  • Definition 4.1
  • Definition 4.2
  • Definition 4.3
  • Definition 4.4
  • Lemma 4.5
  • Definition 4.6
  • Lemma 4.7
  • Lemma 5.2
  • Theorem 5.3
  • Corollary 5.6
  • ...and 11 more