Table of Contents
Fetching ...

Switch EMA: A Free Lunch for Better Flatness and Sharpness

Siyuan Li, Zicheng Liu, Juanxi Tian, Ge Wang, Zedong Wang, Weiyang Jin, Di Wu, Cheng Tan, Tao Lin, Yang Liu, Baigui Sun, Stan Z. Li

TL;DR

Comprehensive results with popular optimizers and networks show that SEMA is a free lunch for DNN training by improving performances and boosting convergence speeds.

Abstract

Exponential Moving Average (EMA) is a widely used weight averaging (WA) regularization to learn flat optima for better generalizations without extra cost in deep neural network (DNN) optimization. Despite achieving better flatness, existing WA methods might fall into worse final performances or require extra test-time computations. This work unveils the full potential of EMA with a single line of modification, i.e., switching the EMA parameters to the original model after each epoch, dubbed as Switch EMA (SEMA). From both theoretical and empirical aspects, we demonstrate that SEMA can help DNNs to reach generalization optima that better trade-off between flatness and sharpness. To verify the effectiveness of SEMA, we conduct comparison experiments with discriminative, generative, and regression tasks on vision and language datasets, including image classification, self-supervised learning, object detection and segmentation, image generation, video prediction, attribute regression, and language modeling. Comprehensive results with popular optimizers and networks show that SEMA is a free lunch for DNN training by improving performances and boosting convergence speeds.

Switch EMA: A Free Lunch for Better Flatness and Sharpness

TL;DR

Comprehensive results with popular optimizers and networks show that SEMA is a free lunch for DNN training by improving performances and boosting convergence speeds.

Abstract

Exponential Moving Average (EMA) is a widely used weight averaging (WA) regularization to learn flat optima for better generalizations without extra cost in deep neural network (DNN) optimization. Despite achieving better flatness, existing WA methods might fall into worse final performances or require extra test-time computations. This work unveils the full potential of EMA with a single line of modification, i.e., switching the EMA parameters to the original model after each epoch, dubbed as Switch EMA (SEMA). From both theoretical and empirical aspects, we demonstrate that SEMA can help DNNs to reach generalization optima that better trade-off between flatness and sharpness. To verify the effectiveness of SEMA, we conduct comparison experiments with discriminative, generative, and regression tasks on vision and language datasets, including image classification, self-supervised learning, object detection and segmentation, image generation, video prediction, attribute regression, and language modeling. Comprehensive results with popular optimizers and networks show that SEMA is a free lunch for DNN training by improving performances and boosting convergence speeds.
Paper Structure (50 sections, 3 theorems, 32 equations, 7 figures, 19 tables, 1 algorithm)

This paper contains 50 sections, 3 theorems, 32 equations, 7 figures, 19 tables, 1 algorithm.

Key Result

Proposition 1

(Low-frequency Oscillation): In the noisy quadratic model, the variance of SGD, EMA, and SEMA iterates, denoted as $V_\text{SGD}$, $V_\text{EMA}$ and $V_\text{SEMA}$, converge to the following values according to Banach's fixed point theorem, and, $V_\text{SEMA}<V_\text{EMA}<V_\text{SGD}$:

Figures (7)

  • Figure 1: Training epoch vs. performance plots of the baseline, EMA, and SEMA. (a) Image classification with DeiT-S on ImageNet-1K (IN-1K); (b) Object detection and segmentation with ResNet-50 Cascade (C.) Mask R-CNN (3$\times$) on COCO; (c) Contrastive learning (CL) pre-training with MoCo.V3 and DeiT-S on CIFAR-100; (d) Face age regression with ResNet-50 on AgeDB. SEMA shows faster convergence speeds and better performances than EMA and the baselines.
  • Figure 2: 1D loss landscape with validation loss and top-1 accuracy of classification on (a)-(f) CIFAR-100 and (g)(h) ImageNet-1K. The loss landscapes of EMA and SWA models are flatter than those of the baseline (using vanilla optimizers), while our proposed SEMA yields deeper and smoother local minima with deepened basins and as flat slopes as EMA. Note that the performance gaps are relatively small on ImageNet-1K due to the massive training data.
  • Figure 3: Illustration of 2D loss landscape and optimization trajectory on Circles test set. EMA models reach the flat basin while the baseline is stuck at the sharp cliff. Projecting the EMA model to the landscape of SEMA, the SEMA model approaches the local minima efficiently.
  • Figure 4: Illustration of the baseline, EMA, and SEMA on Circles Dataset with 50 labeled samples (triangle red/yellow points) and 500 testing samples (gray points) in training a 2-layer MLP. We plot the decision boundary, accuracy, decision boundary width, and prediction calibration.
  • Figure 5: Ablation of the momentum $\alpha$ in EMA and SEMA, searching in the range of 0.9$\sim$0.9999 and 0.99$\sim$0.99999. SEMA shows robust choices of $\alpha$ across the most tasks.
  • ...and 2 more figures

Theorems & Definitions (3)

  • Proposition 1
  • Proposition 2
  • Proposition 3