Table of Contents
Fetching ...

On Defining Neural Averaging

Su Hyeong Lee, Richard Ngo

TL;DR

This work tackles how to define a principled neural average when training data is unavailable, proposing Amortized Model Ensembling (AME) as a data-free meta-optimization in weight space. AME treats differences between pretrained ingredients as pseudogradients and uses adaptive optimization to fuse them, recovering model soup as a special case while enabling more expressive ensembling. Empirically, AME improves out-of-distribution generalization over individual experts and soups, and reveals zero-data training-like benefits across vision transformers and synthetic CIFAR-100 experiments. The framework connects optimization dynamics with weight-space aggregation, offering a versatile tool for federated, privacy-preserving, and domain-heterogeneous settings where data access is restricted.

Abstract

What does it even mean to average neural networks? We investigate the problem of synthesizing a single neural network from a collection of pretrained models, each trained on disjoint data shards, using only their final weights and no access to training data. In forming a definition of neural averaging, we take insight from model soup, which appears to aggregate multiple models into a singular model while enhancing generalization performance. In this work, we reinterpret model souping as a special case of a broader framework: Amortized Model Ensembling (AME) for neural averaging, a data-free meta-optimization approach that treats model differences as pseudogradients to guide neural weight updates. We show that this perspective not only recovers model soup but enables more expressive and adaptive ensembling strategies. Empirically, AME produces averaged neural solutions that outperform both individual experts and model soup baselines, especially in out-of-distribution settings. Our results suggest a principled and generalizable notion of data-free model weight aggregation and defines, in one sense, how to perform neural averaging.

On Defining Neural Averaging

TL;DR

This work tackles how to define a principled neural average when training data is unavailable, proposing Amortized Model Ensembling (AME) as a data-free meta-optimization in weight space. AME treats differences between pretrained ingredients as pseudogradients and uses adaptive optimization to fuse them, recovering model soup as a special case while enabling more expressive ensembling. Empirically, AME improves out-of-distribution generalization over individual experts and soups, and reveals zero-data training-like benefits across vision transformers and synthetic CIFAR-100 experiments. The framework connects optimization dynamics with weight-space aggregation, offering a versatile tool for federated, privacy-preserving, and domain-heterogeneous settings where data access is restricted.

Abstract

What does it even mean to average neural networks? We investigate the problem of synthesizing a single neural network from a collection of pretrained models, each trained on disjoint data shards, using only their final weights and no access to training data. In forming a definition of neural averaging, we take insight from model soup, which appears to aggregate multiple models into a singular model while enhancing generalization performance. In this work, we reinterpret model souping as a special case of a broader framework: Amortized Model Ensembling (AME) for neural averaging, a data-free meta-optimization approach that treats model differences as pseudogradients to guide neural weight updates. We show that this perspective not only recovers model soup but enables more expressive and adaptive ensembling strategies. Empirically, AME produces averaged neural solutions that outperform both individual experts and model soup baselines, especially in out-of-distribution settings. Our results suggest a principled and generalizable notion of data-free model weight aggregation and defines, in one sense, how to perform neural averaging.

Paper Structure

This paper contains 43 sections, 2 theorems, 20 equations, 17 figures, 4 tables, 3 algorithms.

Key Result

Proposition 1

(Informal) There exists a learning rate schedule $\eta_i$ and adaptivity parameter schedule $\varepsilon_i$ such that Adagradensemble$\approx$GDensemble.

Figures (17)

  • Figure 1: Performance of GD SGD, Adagrad AdaGrad, AdamW AdamW, and Adadelta Adadelta amortized ensembling of ViT-S ViT trained on GLD-23K 49052, from left to right, after 1 ensembling epoch. No training nor testing data of any sort are used during the ensemble process (data-free), and only trained model weights are accessed. Plots depict classification accuracy on GLD-23K testing data unseen during training, for a variety of optimizer hyperparameter choices (described in Appendix \ref{['SetupandDatasetAppendix']}). Horizontal axis is the number of model training epochs, different from ensembling epochs, and the color transition indicates the number of ensemble ingredients (models aggregated), ranging from 2 (dark) to 16 (light). This figure shows that on a macroscopic level, each optimizer instantiation induces qualitatively variant ensemble accuracy dynamics, affirming the existence of diverse ensembling strategies uncovered by the AME framework. Details are given in full in Appendix \ref{['ExperimentSetupAppendix']}.
  • Figure 2: Ensemble test accuracy results of ViT-S fine-tuned on centralized GLD-23K dataset over hyperparameter sweep and experiment setup detailed in Appendix \ref{['ExperimentSetupAppendix']}. Each continuous line depicts a single hyperparameter configuration in AME, where an ensemble is formed for each x-axis timestep, which represents training epochs from 0 to 24. The title states the number of ensemble epochs where each model ingredient is treated as a datapoint. AME was instantiated with gradient descent, and batching used two ingredient models per batch. We observe that as the number of ensemble epochs range across 1, 5, and 15, the ensemble performance improves and solidifies, manifesting the effects of zero-data model training by simply increasing the number of ensemble epochs. Batching as well as shuffling the neural nets being fused further enhances the ensemble performance. This shows that simply running more meta-optimization epochs, with batching and shuffling of the ensemble ingredients, enhances performance. Additionally, adding more high-performance model ingredients also benefits test accuracy. Additional results are contained in Appendix \ref{['GDEnsembleAppendix']}.
  • Figure 3: The first column displays uniformly averaged (souped) images from each CIFAR-100 class's training set. Subsequent columns show analogous outputs from AME, applied to 500 randomly initialized image tensors per class. Image synthesis was performed using AdamW-based AME with 40 optimization sweeps. The classifier--a ViT fine-tuned solely on the original CIFAR-100 dataset--was never exposed to the synthesized data. Percentages indicate the proportion of synthesized images per class correctly classified by the ViT (out of 500), where random-guessing accuracy is 1%. For further visualizations and AME hyperparameter details, we refer to Appendix \ref{['Neural_Averaging_Images']}.
  • Figure 4: Each panel compares AME ensembles (Adam or GD) against model soup using identical data samples per trial. In (a-b), we verify that under the heavy-tailed Cauchy distribution, Adam-AME demonstrates superior performance than GD-AME. For a non-heavy-tailed Gaussian (c), GD-AME ensembles (red) align closely with the soups (blue) centered around the green MLE, and are therefore visually occluded. Full details and hyperparameter settings are described in Appendix \ref{['StatisticalEstimators']}.
  • Figure 5: ViT-S ensemble performance on GLD-23K training data (training accuracy), where the x-axis for each plot is a training epoch. The title states the number of ensemble epochs used, per each training epoch ranging from $0$ to $24$. The y-axis is set to $(0, 1)$. Compared to Figure \ref{['example_val_figure']}, which gives analogous results for a held-out validation set, the best performing ensembles reach near $1$ accuracy. It can be seen that additional ensemble epochs as well as Batching/Shuffling can greatly assist in enhancing ensemble performance. All batching used two ingredient models per batch in this section.
  • ...and 12 more figures

Theorems & Definitions (4)

  • Proposition 1
  • proof
  • Proposition 2
  • proof