Table of Contents
Fetching ...

Soft ascent-descent as a stable and flexible alternative to flooding

Matthew J. Holland, Kosuke Nakatani

TL;DR

A softened, pointwise mechanism called SoftAD (soft ascent-descent) is proposed (soft ascent-descent) that downweights points on the borderline, limits the effects of outliers, and retains the ascent-descent effect of flooding, with no additional computational overhead.

Abstract

As a heuristic for improving test accuracy in classification, the "flooding" method proposed by Ishida et al. (2020) sets a threshold for the average surrogate loss at training time; above the threshold, gradient descent is run as usual, but below the threshold, a switch to gradient ascent is made. While setting the threshold is non-trivial and is usually done with validation data, this simple technique has proved remarkably effective in terms of accuracy. On the other hand, what if we are also interested in other metrics such as model complexity or average surrogate loss at test time? As an attempt to achieve better overall performance with less fine-tuning, we propose a softened, pointwise mechanism called SoftAD (soft ascent-descent) that downweights points on the borderline, limits the effects of outliers, and retains the ascent-descent effect of flooding, with no additional computational overhead. We contrast formal stationarity guarantees with those for flooding, and empirically demonstrate how SoftAD can realize classification accuracy competitive with flooding (and the more expensive alternative SAM) while enjoying a much smaller loss generalization gap and model norm.

Soft ascent-descent as a stable and flexible alternative to flooding

TL;DR

A softened, pointwise mechanism called SoftAD (soft ascent-descent) is proposed (soft ascent-descent) that downweights points on the borderline, limits the effects of outliers, and retains the ascent-descent effect of flooding, with no additional computational overhead.

Abstract

As a heuristic for improving test accuracy in classification, the "flooding" method proposed by Ishida et al. (2020) sets a threshold for the average surrogate loss at training time; above the threshold, gradient descent is run as usual, but below the threshold, a switch to gradient ascent is made. While setting the threshold is non-trivial and is usually done with validation data, this simple technique has proved remarkably effective in terms of accuracy. On the other hand, what if we are also interested in other metrics such as model complexity or average surrogate loss at test time? As an attempt to achieve better overall performance with less fine-tuning, we propose a softened, pointwise mechanism called SoftAD (soft ascent-descent) that downweights points on the borderline, limits the effects of outliers, and retains the ascent-descent effect of flooding, with no additional computational overhead. We contrast formal stationarity guarantees with those for flooding, and empirically demonstrate how SoftAD can realize classification accuracy competitive with flooding (and the more expensive alternative SAM) while enjoying a much smaller loss generalization gap and model norm.
Paper Structure (41 sections, 3 theorems, 46 equations, 12 figures, 4 tables)

This paper contains 41 sections, 3 theorems, 46 equations, 12 figures, 4 tables.

Key Result

Proposition 3

Starting with an arbitrary $w_{1} \in \mathcal{W}$, update using $w_{t+1} = w_{t} - \alpha \mathsf{M}_{t}/\lVert{\mathsf{M}_{t}}\rVert$, where $\mathsf{M}_{t} \mathrel{\vcenter{\hbox{\scriptsize.}\hbox{\scriptsize.}}}= b\mathsf{M}_{t-1} + (1-b)\mkern 1.5mu\overline{\mkern-1.5mu\mathsf{G}\mkern-1.5mu using confidence-dependent factor $C_{\delta} \mathrel{\vcenter{\hbox{\scriptsize.}\hbox{\scriptsiz

Figures (12)

  • Figure 1: The left-most figure simply plots the graph of $f(x) = x^{2}/2$ over $x \in [-2,2]$. The two remaining figures show plots of the graphs of $f^{\prime}(x) = x$ (dashed black line) and $\phi((f(x)-\theta)/\sigma)f^{\prime}(x)$ for the same range of $x$ values, with colors corresponding to modified values of $\sigma$ (middle plot; $\theta = 0.5$ fixed) and $\theta$ (right-most plot; $\sigma = 1.0$ fixed) respectively. Thick dotted lines are $\phi=\mathop{\mathrm{sign}}\nolimits$, thin solid lines are $\phi=\rho^{\prime}$.
  • Figure 2: Gradient descent on the quadratic example from Figure \ref{['fig:demo_quadratic']}. The horizontal axis denotes iteration number, and we plot sequences of iterates $(x_{t})$ and function values $(f(x_{t}))$ for each method. Here "GD" denotes vanilla gradient descent, with "Flood" and "SoftAD" corresponding to (\ref{['eqn:flood_batch']}) and (\ref{['eqn:softAD_update']}) respectively. Step size is $\alpha = 0.1$.
  • Figure 3: Left: We randomly sample $n=8$ points (black dots) from the 2D Gaussian distribution, zero mean, zero correlations, with standard deviation $2\sqrt{2}$ in each coordinate. The two candidates are denoted by square-shaped points (red and green), and the minimizer of $\mathsf{R}_{n}$ is given by a gold star. Center: The Flooding updates (colored arrows) via (\ref{['eqn:flood_batch']}) for each candidate. Right: Analogous SoftAD update vectors via (\ref{['eqn:softAD_update']}), with per-point transformed gradients (semi-transparent arrows) for reference. Throughout, we have fixed $\theta = 1.5 \times \min_{w}\mathsf{R}_{n}(w)$ and $\alpha = 0.75$.
  • Figure 4: Trajectories over epochs for average test loss (top row) and test accuracy (bottom row). Horizontal axis is epoch number. Columns are associated with the CIFAR-10 and CIFAR-100 datasets (left to right).
  • Figure 5: Analogous to Figure \ref{['fig:benchmarks_1']}, but with FashionMNIST and SVHN datasets.
  • ...and 7 more figures

Theorems & Definitions (10)

  • Remark 1: Comparison with other variants of Flooding
  • Remark 2: Difference from OCE-like criteria
  • Proposition 3: Stationarity for SoftAD, smooth case
  • Proposition 4: Stationarity for Flooding, smooth case
  • Remark 5: Comparing rates and assumptions
  • Proposition 6: Stationarity, non-smooth case
  • Remark 7: Stationarity in the original objective
  • proof : Proof of Proposition \ref{['prop:stationarity_softad_smooth']}
  • proof : Proof of Proposition \ref{['prop:stationarity_flood_smooth']}
  • proof : Proof of Proposition \ref{['prop:stationarity_nonsmooth']}