Table of Contents
Fetching ...

Efficient line search for optimizing Area Under the ROC Curve in gradient descent

Jadon Fowler, Toby Dylan Hocking

TL;DR

This paper studies the piecewise linear/constant nature of the AUM/AUC, and proposes new efficient path-following algorithms for choosing the learning rate which is optimal for each step of gradient descent (line search), when optimizing a linear model.

Abstract

Receiver Operating Characteristic (ROC) curves are useful for evaluation in binary classification and changepoint detection, but difficult to use for learning since the Area Under the Curve (AUC) is piecewise constant (gradient zero almost everywhere). Recently the Area Under Min (AUM) of false positive and false negative rates has been proposed as a differentiable surrogate for AUC. In this paper we study the piecewise linear/constant nature of the AUM/AUC, and propose new efficient path-following algorithms for choosing the learning rate which is optimal for each step of gradient descent (line search), when optimizing a linear model. Remarkably, our proposed line search algorithm has the same log-linear asymptotic time complexity as gradient descent with constant step size, but it computes a complete representation of the AUM/AUC as a function of step size. In our empirical study of binary classification problems, we verify that our proposed algorithm is fast and exact; in changepoint detection problems we show that the proposed algorithm is just as accurate as grid search, but faster.

Efficient line search for optimizing Area Under the ROC Curve in gradient descent

TL;DR

This paper studies the piecewise linear/constant nature of the AUM/AUC, and proposes new efficient path-following algorithms for choosing the learning rate which is optimal for each step of gradient descent (line search), when optimizing a linear model.

Abstract

Receiver Operating Characteristic (ROC) curves are useful for evaluation in binary classification and changepoint detection, but difficult to use for learning since the Area Under the Curve (AUC) is piecewise constant (gradient zero almost everywhere). Recently the Area Under Min (AUM) of false positive and false negative rates has been proposed as a differentiable surrogate for AUC. In this paper we study the piecewise linear/constant nature of the AUM/AUC, and propose new efficient path-following algorithms for choosing the learning rate which is optimal for each step of gradient descent (line search), when optimizing a linear model. Remarkably, our proposed line search algorithm has the same log-linear asymptotic time complexity as gradient descent with constant step size, but it computes a complete representation of the AUM/AUC as a function of step size. In our empirical study of binary classification problems, we verify that our proposed algorithm is fast and exact; in changepoint detection problems we show that the proposed algorithm is just as accurate as grid search, but faster.

Paper Structure

This paper contains 25 sections, 2 theorems, 14 equations, 6 figures.

Key Result

Theorem 1

For data with $B$ breakpoints in label error functions, the initial AUM slope is computed via (eq:D1) in log-linear $O(B\log B)$ time. If $\beta\in\{2,\dots,B\}$ is the index of the function $T_\beta$ which is larger before an intersection at step size $\sigma_{k+1}$, then the next AUM slope $D^{k+1

Figures (6)

  • Figure 1: For four binary classification models, there is one letter, A--Z, for each ROC point (top), and corresponding interval of constants added to predictions (bottom). Number next to each ROC point shows min(FPR,FNR) (same as purple heat map values, and black curve in bottom plot), which is minimal (0) when AUC is maximal (1). The proposed algorithm is for minimizing the AUM, Area Under Min(FPR,FNR) (grey shaded region in bottom plot), which is a differentiable surrogate for the sum of min(FPR,FNR) over all points on the ROC curve (sum(min) values shown in top panel titles).
  • Figure 2: Two labeled changepoint problems (left), and corresponding error functions (right). In these two problems, the FN/FP error functions are non-monotonic, because a changepoint disappears when moving from model size 1 to 2. Vertical purple lines (right) mark predicted values which result in the line search shown in Figure \ref{['fig:line-search-example']}.
  • Figure 3: Demonstration of proposed line search algorithm, for a simple binary classification problem with four data. Top left: ROC curves at three step sizes, with shaded grey area showing parts of AUC involved in the update rules (\ref{['eq:AUC_without']}--\ref{['eq:AUC_after']}). Bottom left: error rate functions at three step sizes, with grey arrows showing the gradient, and shaded grey area (C) showing the AUM, Area Under Min(FPR,FNR). Right: AUC, AUM, and threshold functions $T_b(s)$ (black lines), as a function of step size. There is one letter for every ROC point, corresponding to an interval of constants added to predicted values at a given step size.
  • Figure 4: Demonstration of proposed line search algorithm, for the same two labeled changepoint data sequences as in Figure \ref{['fig:two-labeled-changepoint']}. It starts by computing AUM/AUC at step size 0 (vertical red line, iteration 1), and storing the next possible intersection points in a red-black tree (right table). Iteration 2 removes the intersection point with the smallest step size (ab), resulting in a change of AUM slope (from -2 to 0), and a change of AUC values (from 0 to 0.5 at the intersection point, then to 1 after), and no new intersection points. Three vertical grey lines represent variants with different stopping rules: first min is the smallest step size such that AUM would increase for larger step sizes, linear is the same number of iterations as the number of red lines in the threshold plot (6 lines: a--f), and quadratic means to explore all positive step sizes (shaded grey area). Note AUC can be larger than 1 because there are cycles/loops in the ROC curve, due to non-monotonic label error functions.
  • Figure 5: Asymptotic time complexity of gradient descent with proposed line search in four binary classification data sets (CIFAR10, FashionMNIST, MNIST, STL10, first class versus others). Top: number of line search iterations per gradient descent step is $O(n^2)$ in worst case (max); exploring all intersections is $O(n^2)$ (quadratic); exploring only the first $n$ is $O(n)$ (linear); exploring until AUM increases (first min) is sub-quadratic but super-linear. Bottom: number of gradient descent steps until AUM stops decreasing (within $10^{-3}$); first min/quadratic methods (larger step sizes) take asymptotically fewer steps than linear (smaller step sizes).
  • ...and 1 more figures

Theorems & Definitions (4)

  • Theorem 1
  • proof
  • Theorem 2
  • proof