How to Boost Any Loss Function

Richard Nock; Yishay Mansour

How to Boost Any Loss Function

Richard Nock, Yishay Mansour

TL;DR

Any loss function can be optimized with boosting and thus boosting can achieve a feat not yet known to be possible in the classical $0^{th}$ order setting, since loss functions are not required to be be convex, nor differentiable or Lipschitz -- and in fact not required to be continuous either.

Abstract

Boosting is a highly successful ML-born optimization setting in which one is required to computationally efficiently learn arbitrarily good models based on the access to a weak learner oracle, providing classifiers performing at least slightly differently from random guessing. A key difference with gradient-based optimization is that boosting's original model does not requires access to first order information about a loss, yet the decades long history of boosting has quickly evolved it into a first order optimization setting -- sometimes even wrongfully defining it as such. Owing to recent progress extending gradient-based optimization to use only a loss' zeroth ($0^{th}$) order information to learn, this begs the question: what loss functions can be efficiently optimized with boosting and what is the information really needed for boosting to meet the original boosting blueprint's requirements? We provide a constructive formal answer essentially showing that any loss function can be optimized with boosting and thus boosting can achieve a feat not yet known to be possible in the classical $0^{th}$ order setting, since loss functions are not required to be be convex, nor differentiable or Lipschitz -- and in fact not required to be continuous either. Some tools we use are rooted in quantum calculus, the mathematical field -- not to be confounded with quantum computation -- that studies calculus without passing to the limit, and thus without using first order information.

How to Boost Any Loss Function

TL;DR

Any loss function can be optimized with boosting and thus boosting can achieve a feat not yet known to be possible in the classical

order setting, since loss functions are not required to be be convex, nor differentiable or Lipschitz -- and in fact not required to be continuous either.

Abstract

) order information to learn, this begs the question: what loss functions can be efficiently optimized with boosting and what is the information really needed for boosting to meet the original boosting blueprint's requirements? We provide a constructive formal answer essentially showing that any loss function can be optimized with boosting and thus boosting can achieve a feat not yet known to be possible in the classical

order setting, since loss functions are not required to be be convex, nor differentiable or Lipschitz -- and in fact not required to be continuous either. Some tools we use are rooted in quantum calculus, the mathematical field -- not to be confounded with quantum computation -- that studies calculus without passing to the limit, and thus without using first order information.

Paper Structure (31 sections, 12 theorems, 67 equations, 12 figures, 1 table, 3 algorithms)

This paper contains 31 sections, 12 theorems, 67 equations, 12 figures, 1 table, 3 algorithms.

Introduction
Related work
Definitions and notations
$v$-derivatives and Bregman secant distortions
Boosting using only queries on the loss
Algorithm: SecBoost
General steps
The offset oracle, oo
Convergence of SecBoost
Boosting-compliant convergence
Finding $\overline{W}_{2,t}$
Implementation of the offset oracle
Discussion
Conclusion
A quick summary of recent zeroth-order optimization approaches
...and 16 more sections

Key Result

Lemma 4.5

Suppose $F$ strictly convex differentiable. Then $\lim_{v\rightarrow 0} S_{F|v}(z'\|z) = D_F(z'\| F'(z))$.

Figures (12)

Figure 1: Left: value of $S_{F|v}( z'\|z)$ for convex $F$, $v \stackrel{\mathrm{.}}{=} z_4 - z$ and various $z'$ (colors), for which the Bregman Secant distortion is positive ($z'=z_1$, green), negative ($z'=z_2$, red), minimal ($z'=z_3$) or null ($z'=z_4, z$). Right: depiction of $Q_{F}(z,z+v, z')$ for non-convex $F$ (Definition \ref{['defBI']}).
Figure 2: Simplified depiction of $\overline{W}_{2,t}$ "regimes" (Assumption \ref{['assum2cr']}). We only plot the components of the $v$-derivative part in \ref{['boundW2']}: removing index $i$ for readability, we get $\updelta_{\{e_{t},v_{t-1}\}}F(\tilde{e}_{t-1}) = (B_t - A_t) / (y H_t(\bm{x})-y H_{t-1}(\bm{x}))$ with $A_t \stackrel{\mathrm{.}}{=} \updelta_{v_{t-1}}F(y H_{t-1}(\bm{x})) = - w_t$ and $B_t \stackrel{\mathrm{.}}{=} \updelta_{v_{t-1}}F(y H_{t}(\bm{x}))$ ($= - w_{t+1}$ iff $v_{t-1} = v_{t}$). If the loss is "nice" like the exponential or logistic losses, we always have a small $\overline{W}_{2,t}$ (a). Place a bump in the loss (b-d) and the risk happens that $\overline{W}_{2,t}$ is too large for the WRA to hold. Workarounds include two strategies: picking small enough offsets (b) or fit offsets large enough to pass the bump (c). The blue arrow in (d) is discussed in Section \ref{['sec:disc']}.
Figure 3: A simple way to build $\mathbb{I}_{ti}(z)$ for a discontinuous loss $F$ ($\tilde{e}_{ti}<\tilde{e}_{(t-1)i}$ and $z$ are represented), $\mathcal{O}$ being the set of solutions as it is built. We rotate two half-lines, one passing through $(\tilde{e}_{ti},F(\tilde{e}_{ti}))$ (thick line, $(\Delta)$) and a parallel one translated by $-z$ (dashed line) (a). As soon as $(\Delta)$ crosses $F$ on any point $(z',F(z'))$ with $z\neq \tilde{e}_{ti}$ while the dashed line stays below $F$, we obtain a candidate offset $v$ for oo, namely $v = z' - \tilde{e}_{ti}$. In (b), we obtain an interval of values. We keep on rotating $(\Delta)$, eventually making appear several intervals for the choice of $v$ if $F$ is not convex (c). Finally, when we reach an angle such that the maximal difference between $(\Delta)$ and $F$ in $[\tilde{e}_{ti},\tilde{e}_{(t-1)i}]$ is $z$ ($z$ can be located at an intersection between $F$ and the dashed line), we stop and obtain the full $\mathbb{I}_{ti}(z)$ (d).
Figure 4: More examples of ensembles $\mathbb{I}_{ti}(z)$ (in blue) for the $F$ in Figure \ref{['f-Iti-const']}. (a): $\mathbb{I}_{ti}(z)$ is the union of two intervals with all candidate offsets non negative. (b): it is a single interval with non-positive offsets. (c): at a discontinuity, if $z$ is smaller than the discontinuity, we have no direct solution for $\mathbb{I}_{ti}(z)$ for at least one positioning of the edges, but a simple trick bypasses the difficulty (see text).
Figure 5: Left: representation of the difference of averages in \ref{['eqDIFFAVG']}. Each of the secants $(\Delta_1)$ and $(\Delta_2)$ can take either the red or black segment. Which one is which depends on the signs of $c$ and $b$, but the general configuration is always the same. Note that if $F$ is convex, one necessarily sits above the other, which is the crux of the proof of Lemma \ref{['lemW2bound']}. For the sake of illustration, suppose we can analytically have $b,c \rightarrow 0$. As $c$ converges to 0 but $b$ remains $>0$, $\updelta_{\{b, c\}}F(a)$ becomes proportional to the variation of the average secant midpoint; the then-convergence of $b$ to 0 makes $\updelta_{\{b, c\}}F(a)$ converge to the second-order derivative of $F$ at $a$. Right: in the special case where $F$ is convex, one of the secants always sits above the other.
...and 7 more figures

Theorems & Definitions (21)

Definition 4.1
Definition 4.2
Definition 4.3
Definition 4.4
Lemma 4.5
Definition 4.6
Lemma 4.7
Lemma 5.2
Theorem 5.3
Corollary 5.6
...and 11 more

How to Boost Any Loss Function

TL;DR

Abstract

How to Boost Any Loss Function

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (12)

Theorems & Definitions (21)