Table of Contents
Fetching ...

Network Fission Ensembles for Low-Cost Self-Ensembles

Hojung Lee, Jong-Seok Lee

TL;DR

This work tackles the high cost of ensemble methods by introducing Network Fission Ensembles (NFE), which converts a single network into a multi-exit architecture through weight pruning and grouping, enabling ensemble-like predictions without additional models. During training, NFE uses ensemble knowledge distillation, treating the outputs from all exits as a joint teacher to guide learning, with the ensemble logits $z_E = \frac{1}{N}\sum_i z_i$ and teacher probabilities $q_E = \text{softmax}(z_E / T)$. The method demonstrates strong performance on CIFAR-100 and Tiny ImageNet with ResNet and Wide-ResNet backbones, achieving higher accuracy than Deep Ensembles and other low-cost ensembles while keeping FLOPs close to a single model; results are robust to moderate sparsity via PaI methods and balanced weight grouping. Overall, NFE provides a practical, scalable pathway to high-accuracy ensemble-like behavior at near-zero additional computational cost, with potential extensions to other computer vision tasks and further optimizations for exit scalability.

Abstract

Recent ensemble learning methods for image classification have been shown to improve classification accuracy with low extra cost. However, they still require multiple trained models for ensemble inference, which eventually becomes a significant burden when the model size increases. In this paper, we propose a low-cost ensemble learning and inference, called Network Fission Ensembles (NFE), by converting a conventional network itself into a multi-exit structure. Starting from a given initial network, we first prune some of the weights to reduce the training burden. We then group the remaining weights into several sets and create multiple auxiliary paths using each set to construct multi-exits. We call this process Network Fission. Through this, multiple outputs can be obtained from a single network, which enables ensemble learning. Since this process simply changes the existing network structure to multi-exits without using additional networks, there is no extra computational burden for ensemble learning and inference. Moreover, by learning from multiple losses of all exits, the multi-exits improve performance via regularization, and high performance can be achieved even with increased network sparsity. With our simple yet effective method, we achieve significant improvement compared to existing ensemble methods. The code is available at https://github.com/hjdw2/NFE.

Network Fission Ensembles for Low-Cost Self-Ensembles

TL;DR

This work tackles the high cost of ensemble methods by introducing Network Fission Ensembles (NFE), which converts a single network into a multi-exit architecture through weight pruning and grouping, enabling ensemble-like predictions without additional models. During training, NFE uses ensemble knowledge distillation, treating the outputs from all exits as a joint teacher to guide learning, with the ensemble logits and teacher probabilities . The method demonstrates strong performance on CIFAR-100 and Tiny ImageNet with ResNet and Wide-ResNet backbones, achieving higher accuracy than Deep Ensembles and other low-cost ensembles while keeping FLOPs close to a single model; results are robust to moderate sparsity via PaI methods and balanced weight grouping. Overall, NFE provides a practical, scalable pathway to high-accuracy ensemble-like behavior at near-zero additional computational cost, with potential extensions to other computer vision tasks and further optimizations for exit scalability.

Abstract

Recent ensemble learning methods for image classification have been shown to improve classification accuracy with low extra cost. However, they still require multiple trained models for ensemble inference, which eventually becomes a significant burden when the model size increases. In this paper, we propose a low-cost ensemble learning and inference, called Network Fission Ensembles (NFE), by converting a conventional network itself into a multi-exit structure. Starting from a given initial network, we first prune some of the weights to reduce the training burden. We then group the remaining weights into several sets and create multiple auxiliary paths using each set to construct multi-exits. We call this process Network Fission. Through this, multiple outputs can be obtained from a single network, which enables ensemble learning. Since this process simply changes the existing network structure to multi-exits without using additional networks, there is no extra computational burden for ensemble learning and inference. Moreover, by learning from multiple losses of all exits, the multi-exits improve performance via regularization, and high performance can be achieved even with increased network sparsity. With our simple yet effective method, we achieve significant improvement compared to existing ensemble methods. The code is available at https://github.com/hjdw2/NFE.
Paper Structure (15 sections, 5 equations, 5 figures, 5 tables, 1 algorithm)

This paper contains 15 sections, 5 equations, 5 figures, 5 tables, 1 algorithm.

Figures (5)

  • Figure 1: Illustration comparing common weight pruning with the proposed Network Fission process. The small boxes represent individual weight parameters. We group the weights of layers into several sets and create multiple auxiliary classifier paths (exits) to construct multi-exits. This Network Fission enables us to obtain multiple outputs with a single network for ensemble learning at almost zero cost.
  • Figure 2: Illustration of the proposed Network Fission Ensembles process. First, if the model is very large, PaI can be applied to reduce its size. Next, weight grouping is performed considering the number of ensemble members (exits) to use ($N=3$ here). Then, the network is transformed into a multi-exit structure using the grouped weights. For training, the ensemble of outputs from all exits is used as the distillation teacher to guide the learning process. Since only the computation process is changed without any explicit structural changes, no additional training and inference burden is incurred. In the illustration, we use the term 'layer' in order to show an example with a simple network structure. But for popular network structures (e.g., ResNet) composed of multiple stages (each of which consists of multiple layers), weight grouping is performed in each stage rather than in each layer separately.
  • Figure 3: Ensemble accuracy (%) vs. FLOPs for inference for CIFAR100 with Wide-ResNet28-10.
  • Figure 4: Test accuracy (%) of different PaI methods with ResNet18 for CIFAR100 with respect to the sparsity.
  • Figure 5: Test accuracy (%) for different weight grouping ratios with ResNet18 for CIFAR100.