Table of Contents
Fetching ...

Convolutional Neural Nets vs Vision Transformers: A SpaceNet Case Study with Balanced vs Imbalanced Regimes

Akshar Gothi

TL;DR

This work provides a controlled, single-dataset comparison of CNNs and Vision Transformers on SpaceNet under two label-distribution regimes: an imbalanced 5-class split and a balanced-resampled split with 700 images per class. By fixing preprocessing, budget, and evaluation protocol, the study reveals that EfficientNetB0 achieves strong accuracy with lower latency on the imbalanced regime, while ViT-Base matches performance with higher compute; under balanced data, EfficientNetB0 reaches ~99% accuracy and ViT-Tiny ~98%, narrowing the architecture gap. The authors emphasize reproducibility through shared manifests and logs, and offer practical guidance: CNNs are advantageous for skewed data and latency-constrained deployments, whereas ViTs can be attractive when class balance and robustness are prioritized. Overall, the results align with broader findings that data scale and pretraining influence ViT performance, while compact CNNs maintain efficiency on modest datasets like SpaceNet.

Abstract

We present a controlled comparison of a convolutional neural network (EfficientNet-B0) and a Vision Transformer (ViT-Base) on SpaceNet under two label-distribution regimes: a naturally imbalanced five-class split and a balanced-resampled split with 700 images per class (70:20:10 train/val/test). With matched preprocessing (224x224, ImageNet normalization), lightweight augmentations, and a 40-epoch budget on a single NVIDIA P100, we report accuracy, macro-F1, balanced accuracy, per-class recall, and deployment metrics (model size and latency). On the imbalanced split, EfficientNet-B0 reaches 93% test accuracy with strong macro-F1 and lower latency; ViT-Base is competitive at 93% with a larger parameter count and runtime. On the balanced split, both models are strong; EfficientNet-B0 reaches 99% while ViT-Base remains competitive, indicating that balancing narrows architecture gaps while CNNs retain an efficiency edge. We release manifests, logs, and per-image predictions to support reproducibility.

Convolutional Neural Nets vs Vision Transformers: A SpaceNet Case Study with Balanced vs Imbalanced Regimes

TL;DR

This work provides a controlled, single-dataset comparison of CNNs and Vision Transformers on SpaceNet under two label-distribution regimes: an imbalanced 5-class split and a balanced-resampled split with 700 images per class. By fixing preprocessing, budget, and evaluation protocol, the study reveals that EfficientNetB0 achieves strong accuracy with lower latency on the imbalanced regime, while ViT-Base matches performance with higher compute; under balanced data, EfficientNetB0 reaches ~99% accuracy and ViT-Tiny ~98%, narrowing the architecture gap. The authors emphasize reproducibility through shared manifests and logs, and offer practical guidance: CNNs are advantageous for skewed data and latency-constrained deployments, whereas ViTs can be attractive when class balance and robustness are prioritized. Overall, the results align with broader findings that data scale and pretraining influence ViT performance, while compact CNNs maintain efficiency on modest datasets like SpaceNet.

Abstract

We present a controlled comparison of a convolutional neural network (EfficientNet-B0) and a Vision Transformer (ViT-Base) on SpaceNet under two label-distribution regimes: a naturally imbalanced five-class split and a balanced-resampled split with 700 images per class (70:20:10 train/val/test). With matched preprocessing (224x224, ImageNet normalization), lightweight augmentations, and a 40-epoch budget on a single NVIDIA P100, we report accuracy, macro-F1, balanced accuracy, per-class recall, and deployment metrics (model size and latency). On the imbalanced split, EfficientNet-B0 reaches 93% test accuracy with strong macro-F1 and lower latency; ViT-Base is competitive at 93% with a larger parameter count and runtime. On the balanced split, both models are strong; EfficientNet-B0 reaches 99% while ViT-Base remains competitive, indicating that balancing narrows architecture gaps while CNNs retain an efficiency edge. We release manifests, logs, and per-image predictions to support reproducibility.

Paper Structure

This paper contains 22 sections, 1 equation, 1 figure, 9 tables.

Figures (1)

  • Figure 1: Confusion matrices across regimes for EfficientNetB0 (CNN) and ViT-Base.