PopulAtion Parameter Averaging (PAPA)

Alexia Jolicoeur-Martineau; Emy Gervais; Kilian Fatras; Yan Zhang; Simon Lacoste-Julien

PopulAtion Parameter Averaging (PAPA)

Alexia Jolicoeur-Martineau, Emy Gervais, Kilian Fatras, Yan Zhang, Simon Lacoste-Julien

TL;DR

PAPA introduces a practical approach to bridge the gap between full ensembling and single-model inference by training a population of diverse networks and gently nudging their weights toward the population mean using an EMA. By combining frequent weight-averaging with model soups at the end of training, PAPA achieves ensemble-like generalization with far lower inference cost. The method yields consistent gains across CIFAR-10/100 and ImageNet, including a notable improvement on ImageNet with a small network population, and demonstrates compatibility with SWA. Overall, PAPA offers a scalable, parallelizable alternative to ensembling that preserves diversity and improves generalization in deep networks while enabling single-model deployment when needed.

Abstract

Ensemble methods combine the predictions of multiple models to improve performance, but they require significantly higher computation costs at inference time. To avoid these costs, multiple neural networks can be combined into one by averaging their weights. However, this usually performs significantly worse than ensembling. Weight averaging is only beneficial when different enough to benefit from combining them, but similar enough to average well. Based on this idea, we propose PopulAtion Parameter Averaging (PAPA): a method that combines the generality of ensembling with the efficiency of weight averaging. PAPA leverages a population of diverse models (trained on different data orders, augmentations, and regularizations) while slowly pushing the weights of the networks toward the population average of the weights. We also propose PAPA variants (PAPA-all, and PAPA-2) that average weights rarely rather than continuously; all methods increase generalization, but PAPA tends to perform best. PAPA reduces the performance gap between averaging and ensembling, increasing the average accuracy of a population of models by up to 0.8% on CIFAR-10, 1.9% on CIFAR-100, and 1.6% on ImageNet when compared to training independent (non-averaged) models.

PopulAtion Parameter Averaging (PAPA)

TL;DR

Abstract

Paper Structure (53 sections, 3 equations, 2 figures, 18 tables, 5 algorithms)

This paper contains 53 sections, 3 equations, 2 figures, 18 tables, 5 algorithms.

Introduction
PopulAtion Parameter Averaging (PAPA)
Training a population of networks by pushing toward the average (PAPA)
Special cases of PAPA when averaging rarely instead of frequently (PAPA-all & PAPA-2)
Handling changes in learning rates
Inference with the population
Model soups
Related work
Concurrent work
Federated learning and averaging over different data partitions
Distributed Consensus Optimization
Genetic algorithms
Averaging in optimization
Permutation-matching and mode connectivity
Experiments
...and 38 more sections

Figures (2)

Figure 1: Illustration of PAPA. Multiple networks (with weights $\theta_j$) are trained on slight variations of the dataset. Every few (10) iterations, the weights are pushed slightly toward the population average of the weights $\bar{\theta}=\sum_{j=1}^p \theta_j$. After training, the weights are averaged to get a single network.
Figure 2: Accuracy (and its change after averaging) at each epoch with PAPA variants on CIFAR-100.

Theorems & Definitions (1)

Definition 2.1

PopulAtion Parameter Averaging (PAPA)

TL;DR

Abstract

PopulAtion Parameter Averaging (PAPA)

Authors

TL;DR

Abstract

Table of Contents

Figures (2)

Theorems & Definitions (1)