Table of Contents
Fetching ...

Which Backbone to Use: A Resource-efficient Domain Specific Comparison for Computer Vision

Pranav Jeevan, Amit Sethi

TL;DR

This work addresses how to choose resource-efficient backbones for domain-specific image classification in data-limited contexts. It systematically benchmarks 11 lightweight CNN backbones (and WaveMix) from Torchvision across 20 datasets spanning natural, texture, remote sensing, plant, astronomy, and medical domains using a consistent fine-tuning setup. The findings show ConvNeXt-Tiny as a strong default for natural images, with EfficientNetV2-S and RegNetY-3.2GF offering strong cross-domain performance, and WaveMix excelling in multi-resolution or low-resolution tasks; transformers generally underperform in low-data regimes. The study provides practical guidance for practitioners and releases code to enable reproducibility, albeit with limitations such as restricted model sizes and focus on image classification.

Abstract

In contemporary computer vision applications, particularly image classification, architectural backbones pre-trained on large datasets like ImageNet are commonly employed as feature extractors. Despite the widespread use of these pre-trained convolutional neural networks (CNNs), there remains a gap in understanding the performance of various resource-efficient backbones across diverse domains and dataset sizes. Our study systematically evaluates multiple lightweight, pre-trained CNN backbones under consistent training settings across a variety of datasets, including natural images, medical images, galaxy images, and remote sensing images. This comprehensive analysis aims to aid machine learning practitioners in selecting the most suitable backbone for their specific problem, especially in scenarios involving small datasets where fine-tuning a pre-trained network is crucial. Even though attention-based architectures are gaining popularity, we observed that they tend to perform poorly under low data finetuning tasks compared to CNNs. We also observed that some CNN architectures such as ConvNeXt, RegNet and EfficientNet performs well compared to others on a diverse set of domains consistently. Our findings provide actionable insights into the performance trade-offs and effectiveness of different backbones, facilitating informed decision-making in model selection for a broad spectrum of computer vision domains. Our code is available here: https://github.com/pranavphoenix/Backbones

Which Backbone to Use: A Resource-efficient Domain Specific Comparison for Computer Vision

TL;DR

This work addresses how to choose resource-efficient backbones for domain-specific image classification in data-limited contexts. It systematically benchmarks 11 lightweight CNN backbones (and WaveMix) from Torchvision across 20 datasets spanning natural, texture, remote sensing, plant, astronomy, and medical domains using a consistent fine-tuning setup. The findings show ConvNeXt-Tiny as a strong default for natural images, with EfficientNetV2-S and RegNetY-3.2GF offering strong cross-domain performance, and WaveMix excelling in multi-resolution or low-resolution tasks; transformers generally underperform in low-data regimes. The study provides practical guidance for practitioners and releases code to enable reproducibility, albeit with limitations such as restricted model sizes and focus on image classification.

Abstract

In contemporary computer vision applications, particularly image classification, architectural backbones pre-trained on large datasets like ImageNet are commonly employed as feature extractors. Despite the widespread use of these pre-trained convolutional neural networks (CNNs), there remains a gap in understanding the performance of various resource-efficient backbones across diverse domains and dataset sizes. Our study systematically evaluates multiple lightweight, pre-trained CNN backbones under consistent training settings across a variety of datasets, including natural images, medical images, galaxy images, and remote sensing images. This comprehensive analysis aims to aid machine learning practitioners in selecting the most suitable backbone for their specific problem, especially in scenarios involving small datasets where fine-tuning a pre-trained network is crucial. Even though attention-based architectures are gaining popularity, we observed that they tend to perform poorly under low data finetuning tasks compared to CNNs. We also observed that some CNN architectures such as ConvNeXt, RegNet and EfficientNet performs well compared to others on a diverse set of domains consistently. Our findings provide actionable insights into the performance trade-offs and effectiveness of different backbones, facilitating informed decision-making in model selection for a broad spectrum of computer vision domains. Our code is available here: https://github.com/pranavphoenix/Backbones
Paper Structure (18 sections, 2 figures, 9 tables)

This paper contains 18 sections, 2 figures, 9 tables.

Figures (2)

  • Figure 1: Variation of accuracy of top 3 backbones with increasing training data. The model performance at 3 orders of training data, 1% (500 images), 10% (5000 images) and 100% (50,000 images).
  • Figure 2: Variation of accuracy of top 3 backbones with increasing training data. The model performance at 3 orders of training data, 1% (1000 images), 10% (10,000 images) and 100% (100,000 images).