Soft ascent-descent as a stable and flexible alternative to flooding

Matthew J. Holland; Kosuke Nakatani

Soft ascent-descent as a stable and flexible alternative to flooding

Matthew J. Holland, Kosuke Nakatani

TL;DR

A softened, pointwise mechanism called SoftAD (soft ascent-descent) is proposed (soft ascent-descent) that downweights points on the borderline, limits the effects of outliers, and retains the ascent-descent effect of flooding, with no additional computational overhead.

Abstract

As a heuristic for improving test accuracy in classification, the "flooding" method proposed by Ishida et al. (2020) sets a threshold for the average surrogate loss at training time; above the threshold, gradient descent is run as usual, but below the threshold, a switch to gradient ascent is made. While setting the threshold is non-trivial and is usually done with validation data, this simple technique has proved remarkably effective in terms of accuracy. On the other hand, what if we are also interested in other metrics such as model complexity or average surrogate loss at test time? As an attempt to achieve better overall performance with less fine-tuning, we propose a softened, pointwise mechanism called SoftAD (soft ascent-descent) that downweights points on the borderline, limits the effects of outliers, and retains the ascent-descent effect of flooding, with no additional computational overhead. We contrast formal stationarity guarantees with those for flooding, and empirically demonstrate how SoftAD can realize classification accuracy competitive with flooding (and the more expensive alternative SAM) while enjoying a much smaller loss generalization gap and model norm.

Soft ascent-descent as a stable and flexible alternative to flooding

TL;DR

Abstract

Paper Structure (41 sections, 3 theorems, 46 equations, 12 figures, 4 tables)

This paper contains 41 sections, 3 theorems, 46 equations, 12 figures, 4 tables.

Introduction
Background
Flooding
Links to sharpness
Soft ascent-descent
Initial comparison with Flooding
Comparison of convergence properties
Empirical study
Overview of experiments
Non-linear binary classification on the plane
Image classification from scratch
Main findings
Uniformly small loss generalization gap
Balance of accuracy and loss on real data
Uniformly smaller model norms
...and 26 more sections

Key Result

Proposition 3

Starting with an arbitrary $w_{1} \in \mathcal{W}$, update using $w_{t+1} = w_{t} - \alpha \mathsf{M}_{t}/\lVert{\mathsf{M}_{t}}\rVert$, where $\mathsf{M}_{t} \mathrel{\vcenter{\hbox{\scriptsize.}\hbox{\scriptsize.}}}= b\mathsf{M}_{t-1} + (1-b)\mkern 1.5mu\overline{\mkern-1.5mu\mathsf{G}\mkern-1.5mu using confidence-dependent factor $C_{\delta} \mathrel{\vcenter{\hbox{\scriptsize.}\hbox{\scriptsiz

Figures (12)

Figure 1: The left-most figure simply plots the graph of $f(x) = x^{2}/2$ over $x \in [-2,2]$. The two remaining figures show plots of the graphs of $f^{\prime}(x) = x$ (dashed black line) and $\phi((f(x)-\theta)/\sigma)f^{\prime}(x)$ for the same range of $x$ values, with colors corresponding to modified values of $\sigma$ (middle plot; $\theta = 0.5$ fixed) and $\theta$ (right-most plot; $\sigma = 1.0$ fixed) respectively. Thick dotted lines are $\phi=\mathop{\mathrm{sign}}\nolimits$, thin solid lines are $\phi=\rho^{\prime}$.
Figure 2: Gradient descent on the quadratic example from Figure \ref{['fig:demo_quadratic']}. The horizontal axis denotes iteration number, and we plot sequences of iterates $(x_{t})$ and function values $(f(x_{t}))$ for each method. Here "GD" denotes vanilla gradient descent, with "Flood" and "SoftAD" corresponding to (\ref{['eqn:flood_batch']}) and (\ref{['eqn:softAD_update']}) respectively. Step size is $\alpha = 0.1$.
Figure 3: Left: We randomly sample $n=8$ points (black dots) from the 2D Gaussian distribution, zero mean, zero correlations, with standard deviation $2\sqrt{2}$ in each coordinate. The two candidates are denoted by square-shaped points (red and green), and the minimizer of $\mathsf{R}_{n}$ is given by a gold star. Center: The Flooding updates (colored arrows) via (\ref{['eqn:flood_batch']}) for each candidate. Right: Analogous SoftAD update vectors via (\ref{['eqn:softAD_update']}), with per-point transformed gradients (semi-transparent arrows) for reference. Throughout, we have fixed $\theta = 1.5 \times \min_{w}\mathsf{R}_{n}(w)$ and $\alpha = 0.75$.
Figure 4: Trajectories over epochs for average test loss (top row) and test accuracy (bottom row). Horizontal axis is epoch number. Columns are associated with the CIFAR-10 and CIFAR-100 datasets (left to right).
Figure 5: Analogous to Figure \ref{['fig:benchmarks_1']}, but with FashionMNIST and SVHN datasets.
...and 7 more figures

Theorems & Definitions (10)

Remark 1: Comparison with other variants of Flooding
Remark 2: Difference from OCE-like criteria
Proposition 3: Stationarity for SoftAD, smooth case
Proposition 4: Stationarity for Flooding, smooth case
Remark 5: Comparing rates and assumptions
Proposition 6: Stationarity, non-smooth case
Remark 7: Stationarity in the original objective
proof : Proof of Proposition \ref{['prop:stationarity_softad_smooth']}
proof : Proof of Proposition \ref{['prop:stationarity_flood_smooth']}
proof : Proof of Proposition \ref{['prop:stationarity_nonsmooth']}

Soft ascent-descent as a stable and flexible alternative to flooding

TL;DR

Abstract

Soft ascent-descent as a stable and flexible alternative to flooding

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (12)

Theorems & Definitions (10)