Table of Contents
Fetching ...

NeuroTrails: Training with Dynamic Sparse Heads as the Key to Effective Ensembling

Bram Grooten, Farid Hasanov, Chenxiang Zhang, Qiao Xiao, Boqian Wu, Zahra Atashgahi, Ghada Sokar, Shiwei Liu, Lu Yin, Elena Mocanu, Mykola Pechenizkiy, Decebal Constantin Mocanu

TL;DR

NeuroTrails tackles the high computational cost of ensembles by splitting a network into a shared backbone and multiple dynamically sparsified heads that are periodically reshaped during training. By using dynamic sparse training and a layer/block-based architecture split, it creates diverse predictive trails while preserving efficient inference via soft voting across heads. Empirical results across CV and language tasks show improved accuracy and robustness with substantially fewer parameters and FLOPs, including real-time CPU speedups using DeepSparse and strong zero-shot generalization on downstream tasks. The work identifies a Goldilocks region of prediction diversity that maximizes ensemble benefit, offering a practical, model-agnostic approach to efficient ensembling for vision and language models.

Abstract

Model ensembles have long been a cornerstone for improving generalization and robustness in deep learning. However, their effectiveness often comes at the cost of substantial computational overhead. To address this issue, state-of-the-art methods aim to replicate ensemble-class performance without requiring multiple independently trained networks. Unfortunately, these algorithms often still demand considerable compute at inference. In response to these limitations, we introduce $\textbf{NeuroTrails}$, a sparse multi-head architecture with dynamically evolving topology. This unexplored model-agnostic training paradigm improves ensemble performance while reducing the required resources. We analyze the underlying reason for its effectiveness and observe that the various neural trails induced by dynamic sparsity attain a $\textit{Goldilocks zone}$ of prediction diversity. NeuroTrails displays efficacy with convolutional and transformer-based architectures on computer vision and language tasks. Experiments on ResNet-50/ImageNet, LLaMA-350M/C4, among many others, demonstrate increased accuracy and stronger robustness in zero-shot generalization, while requiring significantly fewer parameters.

NeuroTrails: Training with Dynamic Sparse Heads as the Key to Effective Ensembling

TL;DR

NeuroTrails tackles the high computational cost of ensembles by splitting a network into a shared backbone and multiple dynamically sparsified heads that are periodically reshaped during training. By using dynamic sparse training and a layer/block-based architecture split, it creates diverse predictive trails while preserving efficient inference via soft voting across heads. Empirical results across CV and language tasks show improved accuracy and robustness with substantially fewer parameters and FLOPs, including real-time CPU speedups using DeepSparse and strong zero-shot generalization on downstream tasks. The work identifies a Goldilocks region of prediction diversity that maximizes ensemble benefit, offering a practical, model-agnostic approach to efficient ensembling for vision and language models.

Abstract

Model ensembles have long been a cornerstone for improving generalization and robustness in deep learning. However, their effectiveness often comes at the cost of substantial computational overhead. To address this issue, state-of-the-art methods aim to replicate ensemble-class performance without requiring multiple independently trained networks. Unfortunately, these algorithms often still demand considerable compute at inference. In response to these limitations, we introduce , a sparse multi-head architecture with dynamically evolving topology. This unexplored model-agnostic training paradigm improves ensemble performance while reducing the required resources. We analyze the underlying reason for its effectiveness and observe that the various neural trails induced by dynamic sparsity attain a of prediction diversity. NeuroTrails displays efficacy with convolutional and transformer-based architectures on computer vision and language tasks. Experiments on ResNet-50/ImageNet, LLaMA-350M/C4, among many others, demonstrate increased accuracy and stronger robustness in zero-shot generalization, while requiring significantly fewer parameters.

Paper Structure

This paper contains 54 sections, 5 equations, 10 figures, 17 tables, 1 algorithm.

Figures (10)

  • Figure 1: Illustration of NeuroTrails. We divide a network into a shared backbone $\mathcal{F}_s$ and multiple independent heads $\mathcal{F}_h$. Weights are initially pruned at random to a target sparsity ratio. Finally, the network topology is repeatedly refined through dynamic sparse training. The resulting sparse multi-head architecture achieves better performance than a full ensemble while using fewer resources.
  • Figure 2: Testing zero-shot generalization ability on corrupted ImageNet samples and out-of-domain sketches. NeuroTrails outperforms a full ensemble in robustness, despite requiring a fraction of its FLOPS.
  • Figure 3: Performance of NeuroTrails models with varying backbone sizes and sparsification methods (CIFAR-100 with Wide-ResNet28-10). Backbone Length: The most effective (optimizing accuracy and efficiency) backbone length appears around 1/ 3 of the network, meaning 8/12 blocks in head. Sparsification: The dynamic sparse training algorithms RigL and SET demonstrate superior performance, confirming DST as the optimal approach.
  • Figure 4: Example of a CIFAR-100 test-set image where too much prediction diversity between heads degrades performance. NeuroTrails with 8 blocks in each head seems to get the amount of diversity just right. For more illustrations of predictions with overly large diversity, see \ref{['sec:goldilocks']}.
  • Figure 5: Accuracy and Prediction Disagreement throughout training for a NeuroTrails model on CIFAR-100, showing an inverse trend.
  • ...and 5 more figures