Table of Contents
Fetching ...

Neural Subnetwork Ensembles

Tim Whitaker

TL;DR

This work proposes Subnetwork Ensembles, a low-cost framework for building neural network ensembles by sampling, perturbing, and optimizing subnetworks from a trained parent model. It formalizes three perturbation families—Noisy, Sparse, and Stochastic—and introduces Neural Partitioning to maximize diversity while reducing parameter overlap. Across ImageNet, CIFAR, and ProcGen benchmarks, the approach achieves consistent generalization gains while dramatically reducing training cost and parameter usage, with sparse and stochastic variants providing further robustness and scalability. The framework enables dynamic ensemble growth, leverages pre-trained models, and offers rich diversity analysis through both output metrics and interpretability-based representations, suggesting practical impact for efficient, robust ensemble learning in large-scale deep networks.

Abstract

Neural network ensembles have been effectively used to improve generalization by combining the predictions of multiple independently trained models. However, the growing scale and complexity of deep neural networks have led to these methods becoming prohibitively expensive and time consuming to implement. Low-cost ensemble methods have become increasingly important as they can alleviate the need to train multiple models from scratch while retaining the generalization benefits that traditional ensemble learning methods afford. This dissertation introduces and formalizes a low-cost framework for constructing Subnetwork Ensembles, where a collection of child networks are formed by sampling, perturbing, and optimizing subnetworks from a trained parent model. We explore several distinct methodologies for generating child networks and we evaluate their efficacy through a variety of ablation studies and established benchmarks. Our findings reveal that this approach can greatly improve training efficiency, parametric utilization, and generalization performance while minimizing computational cost. Subnetwork Ensembles offer a compelling framework for exploring how we can build better systems by leveraging the unrealized potential of deep neural networks.

Neural Subnetwork Ensembles

TL;DR

This work proposes Subnetwork Ensembles, a low-cost framework for building neural network ensembles by sampling, perturbing, and optimizing subnetworks from a trained parent model. It formalizes three perturbation families—Noisy, Sparse, and Stochastic—and introduces Neural Partitioning to maximize diversity while reducing parameter overlap. Across ImageNet, CIFAR, and ProcGen benchmarks, the approach achieves consistent generalization gains while dramatically reducing training cost and parameter usage, with sparse and stochastic variants providing further robustness and scalability. The framework enables dynamic ensemble growth, leverages pre-trained models, and offers rich diversity analysis through both output metrics and interpretability-based representations, suggesting practical impact for efficient, robust ensemble learning in large-scale deep networks.

Abstract

Neural network ensembles have been effectively used to improve generalization by combining the predictions of multiple independently trained models. However, the growing scale and complexity of deep neural networks have led to these methods becoming prohibitively expensive and time consuming to implement. Low-cost ensemble methods have become increasingly important as they can alleviate the need to train multiple models from scratch while retaining the generalization benefits that traditional ensemble learning methods afford. This dissertation introduces and formalizes a low-cost framework for constructing Subnetwork Ensembles, where a collection of child networks are formed by sampling, perturbing, and optimizing subnetworks from a trained parent model. We explore several distinct methodologies for generating child networks and we evaluate their efficacy through a variety of ablation studies and established benchmarks. Our findings reveal that this approach can greatly improve training efficiency, parametric utilization, and generalization performance while minimizing computational cost. Subnetwork Ensembles offer a compelling framework for exploring how we can build better systems by leveraging the unrealized potential of deep neural networks.
Paper Structure (67 sections, 34 equations, 21 figures, 10 tables, 3 algorithms)

This paper contains 67 sections, 34 equations, 21 figures, 10 tables, 3 algorithms.

Figures (21)

  • Figure 1: A diagram of the McCulloch and Pitts computational model of a neuron used in the Perceptron. The neuron computes a weighted sum of the inputs and then passes the output through a binary step function for classification.
  • Figure 2: Visualizations of how convolutional layers operate on input images. Each convolutional filter in a layer scans over the input image according to a stride length. The middle image displays how a 7x7 filter with a stride of 7 applies to the input image. The right most image displays a 7x7 filter with overlap using a stride of 2.
  • Figure 3: A visualization of child networks formed through three distinct methodologies that we investigate in Chapters 4, 5, and 6. Noisy Subnetwork Ensembles perturb subnetworks with noise. Sparse Subnetwork Ensembles prune subnetworks from the parent. Stochastic Subnetwork Ensembles use probability scores to determine active parameters on every pass through the network.
  • Figure 4: Neuroevolution excels at fine tuning fully-trained networks that get stuck in flat loss basins. The left graph displays a typical SGD training trajectory where models tend to converge to the edges of flat optima izmailov2019averaging. The middle graph displays white dots as child networks that are generated from sparse mutations. The right graph shows the final ensemble consisting of top candidates selected from evaluation on a separate validation set. A key insight to the success of gradient-free methods for fine tuning is that the loss landscapes of test distributions rarely match the training distribution exactly. Ensembling over wide areas of good validation performance is key to improving generalization.
  • Figure 5: The images above detail decision boundaries for a trained three layer multilayer perceptron after being perturbed with various mutations. With dense perturbations, there is a comparatively small window between a functional decision boundary and complete performance collapse. As mutations become more sparse, the model is better able to retain its behavior while the strength of parameter mutation increases.
  • ...and 16 more figures