Table of Contents
Fetching ...

WASH: Train your Ensemble with Communication-Efficient Weight Shuffling, then Average

Louis Fournier, Adel Nabli, Masih Aminbeidokhti, Marco Pedersoli, Eugene Belilovsky, Edouard Oyallon

TL;DR

This work tackles the accuracy-inference-cost trade-off of model ensembling by introducing WASH, a communication-efficient training scheme that enables weight averaging of parallel models through parameter shuffling. By randomly permuting a small fraction of parameters across models with layer-aware adaptation, WASH preserves diversity while keeping models near a consensus, allowing high-performing averaged models with far lower communication than prior EMA-based methods. Empirically, WASH achieves state-of-the-art results on image classification tasks, with averaged models approaching ensemble performance and substantially reduced inference cost; ablations highlight the importance of early-layer shuffling and modest shuffling probabilities. The approach has practical implications for scalable, resource-efficient deployment of ensemble-like models in real-world settings, and code is released for reproducibility.

Abstract

The performance of deep neural networks is enhanced by ensemble methods, which average the output of several models. However, this comes at an increased cost at inference. Weight averaging methods aim at balancing the generalization of ensembling and the inference speed of a single model by averaging the parameters of an ensemble of models. Yet, naive averaging results in poor performance as models converge to different loss basins, and aligning the models to improve the performance of the average is challenging. Alternatively, inspired by distributed training, methods like DART and PAPA have been proposed to train several models in parallel such that they will end up in the same basin, resulting in good averaging accuracy. However, these methods either compromise ensembling accuracy or demand significant communication between models during training. In this paper, we introduce WASH, a novel distributed method for training model ensembles for weight averaging that achieves state-of-the-art image classification accuracy. WASH maintains models within the same basin by randomly shuffling a small percentage of weights during training, resulting in diverse models and lower communication costs compared to standard parameter averaging methods.

WASH: Train your Ensemble with Communication-Efficient Weight Shuffling, then Average

TL;DR

This work tackles the accuracy-inference-cost trade-off of model ensembling by introducing WASH, a communication-efficient training scheme that enables weight averaging of parallel models through parameter shuffling. By randomly permuting a small fraction of parameters across models with layer-aware adaptation, WASH preserves diversity while keeping models near a consensus, allowing high-performing averaged models with far lower communication than prior EMA-based methods. Empirically, WASH achieves state-of-the-art results on image classification tasks, with averaged models approaching ensemble performance and substantially reduced inference cost; ablations highlight the importance of early-layer shuffling and modest shuffling probabilities. The approach has practical implications for scalable, resource-efficient deployment of ensemble-like models in real-world settings, and code is released for reproducibility.

Abstract

The performance of deep neural networks is enhanced by ensemble methods, which average the output of several models. However, this comes at an increased cost at inference. Weight averaging methods aim at balancing the generalization of ensembling and the inference speed of a single model by averaging the parameters of an ensemble of models. Yet, naive averaging results in poor performance as models converge to different loss basins, and aligning the models to improve the performance of the average is challenging. Alternatively, inspired by distributed training, methods like DART and PAPA have been proposed to train several models in parallel such that they will end up in the same basin, resulting in good averaging accuracy. However, these methods either compromise ensembling accuracy or demand significant communication between models during training. In this paper, we introduce WASH, a novel distributed method for training model ensembles for weight averaging that achieves state-of-the-art image classification accuracy. WASH maintains models within the same basin by randomly shuffling a small percentage of weights during training, resulting in diverse models and lower communication costs compared to standard parameter averaging methods.
Paper Structure (29 sections, 8 equations, 6 figures, 4 tables, 1 algorithm)

This paper contains 29 sections, 8 equations, 6 figures, 4 tables, 1 algorithm.

Figures (6)

  • Figure 1: Representation of training with WASH. A population of models is being trained separately. (1) After each training step, (2) a small percentage of parameters are permuted between models. (3) At the end of the training, the model weights are averaged, resulting in a high-performance model.
  • Figure 2: Average distance to the consensus (i.e. the averaged model) during training for a heterogeneous population of $5$ models trained on CIFAR-100, either separately, with PAPA, PAPA-all, or our method WASH. Starting at consensus, models initially diverge from each other before converging back during convergence, mainly due to weight decay. Models trained with WASH have a smaller distance to consensus than ones trained separately; allowing them to be averaged with no performance loss. By training with PAPA-all (i.e. averaging to a single model every few epochs), models are not able to reach the same diversity as WASH between these averaging steps. Finally, the EMA of PAPA has a strong pulling effect towards consensus, resulting in a similar distance as PAPA-all. The jitter in the curve is due to the immediate distance reduction caused by the EMA steps.
  • Figure 3: 2D optimization example. We train 2 points with SGD on a simple loss function with 2 local and 1 global minima (upwards and downwards triangle). The two models are trained from two different starting points (plus signs). If the points are trained separately (yellow), they converge to their closest local minimum (yellow circles). By training with PAPA (blue), the points reach a consensus but then converge to one of the local minima (blue circles). By training with WASH (red), the shuffling (seen by the horizontal and vertical lines in the trajectory) allows more diversity in the optimization path, and the points both reach the global minimum (red circles).
  • Figure 4: Average distance to the consensus for different layer-wise adaptations of WASH, for different slices of the model's parameters. Keeping the probability constant across layers ensures the lowest distance to consensus for the first quarters. Surprisingly, in the last quarter of parameters, despite initially starting with a higher distance to consensus, the 'decreasing probability' shows a lower distance to consensus later in training; despite shuffling being less frequent than the other schedules. The 'increasing probability' showcases how early layers are sensible to shuffling.
  • Figure 5: Ablations of WASH
  • ...and 1 more figures