Table of Contents
Fetching ...

Efficient Adaptive Ensembling for Image Classification

Antonio Bruno, Davide Moroni, Massimo Martinelli

TL;DR

This work tackles the problem of improving image classification accuracy without large increases in model complexity. It introduces Efficient Adaptive Ensembling, which trains two EfficientNet-b0 weak learners on disjoint subsets (bagging) and fuses their representations with a trainable feature-level combiner, resulting in improved accuracy with far fewer parameters and FLOPs. The method achieves an average accuracy gain of $0.5\%$ while reducing parameters by $5$–$60$ times and FLOPs by $10$–$100$ times, demonstrating a practical path to greener, faster high-performance image classification. The approach also provides a framework for extending ensembling to other CV tasks like detection and segmentation, and invites exploration of alternative bagging strategies and fusion mechanisms.

Abstract

In recent times, with the exception of sporadic cases, the trend in Computer Vision is to achieve minor improvements compared to considerable increases in complexity. To reverse this trend, we propose a novel method to boost image classification performances without increasing complexity. To this end, we revisited ensembling, a powerful approach, often not used properly due to its more complex nature and the training time, so as to make it feasible through a specific design choice. First, we trained two EfficientNet-b0 end-to-end models (known to be the architecture with the best overall accuracy/complexity trade-off for image classification) on disjoint subsets of data (i.e. bagging). Then, we made an efficient adaptive ensemble by performing fine-tuning of a trainable combination layer. In this way, we were able to outperform the state-of-the-art by an average of 0.5$\%$ on the accuracy, with restrained complexity both in terms of the number of parameters (by 5-60 times), and the FLoating point Operations Per Second (FLOPS) by 10-100 times on several major benchmark datasets.

Efficient Adaptive Ensembling for Image Classification

TL;DR

This work tackles the problem of improving image classification accuracy without large increases in model complexity. It introduces Efficient Adaptive Ensembling, which trains two EfficientNet-b0 weak learners on disjoint subsets (bagging) and fuses their representations with a trainable feature-level combiner, resulting in improved accuracy with far fewer parameters and FLOPs. The method achieves an average accuracy gain of while reducing parameters by times and FLOPs by times, demonstrating a practical path to greener, faster high-performance image classification. The approach also provides a framework for extending ensembling to other CV tasks like detection and segmentation, and invites exploration of alternative bagging strategies and fusion mechanisms.

Abstract

In recent times, with the exception of sporadic cases, the trend in Computer Vision is to achieve minor improvements compared to considerable increases in complexity. To reverse this trend, we propose a novel method to boost image classification performances without increasing complexity. To this end, we revisited ensembling, a powerful approach, often not used properly due to its more complex nature and the training time, so as to make it feasible through a specific design choice. First, we trained two EfficientNet-b0 end-to-end models (known to be the architecture with the best overall accuracy/complexity trade-off for image classification) on disjoint subsets of data (i.e. bagging). Then, we made an efficient adaptive ensemble by performing fine-tuning of a trainable combination layer. In this way, we were able to outperform the state-of-the-art by an average of 0.5 on the accuracy, with restrained complexity both in terms of the number of parameters (by 5-60 times), and the FLoating point Operations Per Second (FLOPS) by 10-100 times on several major benchmark datasets.
Paper Structure (19 sections, 11 equations, 4 figures, 5 tables)

This paper contains 19 sections, 11 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Example of scaling types, from left to right: a baseline network example, conventional scaling methods that only increase one network dimension (width, depth, resolution) and, at the end, the EfficientNet compound scaling method that uniformly scales all three dimensions with a fixed ratio. Image taken from the original paper efficientnet.
  • Figure 2: Ensemble by voting: the final output is obtained by picking the mode [id=R1](i.e. most frequent class value) among the output produced by the weak learners. In this way[id=R1], the weak learners are independent and voting is effective with a high number of heterogeneous weak learners.
  • Figure 3: Ensemble by output combination: an additional combination layer is fed with the outputs of the weak learners and combines them. In this way, the weak learners are no longer independent and the combination layer can be trained to better adapt to data.
  • Figure 4: Our adaptive ensemble method: is an optimised version of the method shown in Figure \ref{['fig:ensemble-output']} because we avoid redundancy and reduce complexity by deleting the output module (dark grey-filled) of weak learners and feeding the combination layer with the features. Light grey-filled modules denote modules whose parameters are frozen during training. [id=R1]The diagram depicts the case $N=2$, which is used in most of the experiments in this paper, but the method can be applied with an arbitrary value for $N$.