Table of Contents
Fetching ...

When, Where and Why to Average Weights?

Niccolò Ajroldi, Antonio Orvieto, Jonas Geiping

TL;DR

This paper evaluates weight averaging (WA) techniques—specifically Latest Weight Averaging (LAWA) and Exponential Moving Averaging (EMA)—across a diverse AlgoPerf benchmark to determine their effects on training speed and generalization. It demonstrates that WA can significantly reduce training time (e.g., a $12\%$ reduction in GPU-hours) and provide modest generalization gains, and it shows that WA often acts as a proxy for shorter learning-rate decay. However, WA cannot fully replace learning-rate schedules across all workloads; combining WA with LR annealing typically yields the best results. The findings advocate using WA as a practical, low-cost tool to accelerate training while highlighting its role within a broader optimization strategy. The study also extends WA to higher-order optimizers like Distributed Shampoo and provides insights into hyperparameter robustness and horizon selection for real-world deployment.

Abstract

Averaging checkpoints along the training trajectory is a simple yet powerful approach to improve the generalization performance of Machine Learning models and reduce training time. Motivated by these potential gains, and in an effort to fairly and thoroughly benchmark this technique, we present an extensive evaluation of averaging techniques in modern Deep Learning, which we perform using AlgoPerf \citep{dahl_benchmarking_2023}, a large-scale benchmark for optimization algorithms. We investigate whether weight averaging can reduce training time, improve generalization, and replace learning rate decay, as suggested by recent literature. Our evaluation across seven architectures and datasets reveals that averaging significantly accelerates training and yields considerable efficiency gains, at the price of a minimal implementation and memory cost, while mildly improving generalization across all considered workloads. Finally, we explore the relationship between averaging and learning rate annealing and show how to optimally combine the two to achieve the best performances.

When, Where and Why to Average Weights?

TL;DR

This paper evaluates weight averaging (WA) techniques—specifically Latest Weight Averaging (LAWA) and Exponential Moving Averaging (EMA)—across a diverse AlgoPerf benchmark to determine their effects on training speed and generalization. It demonstrates that WA can significantly reduce training time (e.g., a reduction in GPU-hours) and provide modest generalization gains, and it shows that WA often acts as a proxy for shorter learning-rate decay. However, WA cannot fully replace learning-rate schedules across all workloads; combining WA with LR annealing typically yields the best results. The findings advocate using WA as a practical, low-cost tool to accelerate training while highlighting its role within a broader optimization strategy. The study also extends WA to higher-order optimizers like Distributed Shampoo and provides insights into hyperparameter robustness and horizon selection for real-world deployment.

Abstract

Averaging checkpoints along the training trajectory is a simple yet powerful approach to improve the generalization performance of Machine Learning models and reduce training time. Motivated by these potential gains, and in an effort to fairly and thoroughly benchmark this technique, we present an extensive evaluation of averaging techniques in modern Deep Learning, which we perform using AlgoPerf \citep{dahl_benchmarking_2023}, a large-scale benchmark for optimization algorithms. We investigate whether weight averaging can reduce training time, improve generalization, and replace learning rate decay, as suggested by recent literature. Our evaluation across seven architectures and datasets reveals that averaging significantly accelerates training and yields considerable efficiency gains, at the price of a minimal implementation and memory cost, while mildly improving generalization across all considered workloads. Finally, we explore the relationship between averaging and learning rate annealing and show how to optimally combine the two to achieve the best performances.

Paper Structure

This paper contains 36 sections, 39 figures, 4 tables, 2 algorithms.

Figures (39)

  • Figure 1: OGBG
  • Figure 2: WMT
  • Figure 3: Librispeech Conformer
  • Figure 4: Imagenet ViT
  • Figure 6: LAWA and EMA speed up training across several architectures and datasets. Both averaging schemes consistently outperform the baseline, achieving on average the benchmark target score using 82% of the steps required by NadamW. We estimate a 12% reduction in GPU-hours to train the entire AlgoPerf suite of workloads with respect to NadamW.
  • ...and 34 more figures