Table of Contents
Fetching ...

Onboard Satellite Image Classification for Earth Observation: A Comparative Study of ViT Models

Thanh-Dung Le, Vu Nguyen Ha, Ti Ti Nguyen, Geoffrey Eappen, Prabhu Thiruvasagam, Hong-fu Chou, Duc-Dung Tran, Hung Nguyen-Kha, Luis M. Garces-Socarras, Jorge L. Gonzalez-Rios, Juan Carlos Merlano-Duncan, Symeon Chatzinotas

TL;DR

This work tackles onboard land-use classification for Earth observation by comparing a wide range of models, from CNNs and ResNets to pretrained Vision Transformers (ViTs). Using EuroSAT and PatternNet, the authors evaluate training-from-scratch versus pretrained ViTs, emphasizing accuracy, computational efficiency, and robustness to noise, including end-to-end SatCom scenarios. The results show pretrained ViTs outperform-from-scratch, with EfficientViT-M2 delivering the best balance of high accuracy, low computational cost (≈203.53 MFLOPs) and strong noise robustness, making it well suited for resource-constrained onboard processing. The findings provide actionable guidance for deploying robust, energy-efficient onboard EO inference, and point to future directions in multitask, multimodal, and diffusion-based transformer research for spaceborne applications.

Abstract

This study focuses on identifying the most effective pre-trained model for land use classification in onboard satellite processing, emphasizing achieving high accuracy, computational efficiency, and robustness against noisy data conditions commonly encountered during satellite-based inference. Through extensive experimentation, we compare the performance of traditional CNN-based, ResNet-based, and various pre-trained vision Transformer models. Our findings demonstrate that pre-trained Vision Transformer (ViT) models, particularly MobileViTV2 and EfficientViT-M2, outperform models trained from scratch in terms of accuracy and efficiency. These models achieve high performance with reduced computational requirements and exhibit greater resilience during inference under noisy conditions. While MobileViTV2 has excelled on clean validation data, EfficientViT-M2 has proved more robust when handling noise, making it the most suitable model for onboard satellite EO tasks. Our experimental results demonstrate that EfficientViT-M2 is the optimal choice for reliable and efficient RS-IC in satellite operations, achieving 98.76 % of accuracy, precision, and recall. Precisely, EfficientViT-M2 delivers the highest performance across all metrics, excels in training efficiency (1,000s) and inference time (10s), and demonstrates greater robustness (overall robustness score of 0.79). Consequently, EfficientViT-M2 consumes 63.93 % less power than MobileViTV2 (79.23 W) and 73.26 % less power than SwinTransformer (108.90 W). This highlights its significant advantage in energy efficiency.

Onboard Satellite Image Classification for Earth Observation: A Comparative Study of ViT Models

TL;DR

This work tackles onboard land-use classification for Earth observation by comparing a wide range of models, from CNNs and ResNets to pretrained Vision Transformers (ViTs). Using EuroSAT and PatternNet, the authors evaluate training-from-scratch versus pretrained ViTs, emphasizing accuracy, computational efficiency, and robustness to noise, including end-to-end SatCom scenarios. The results show pretrained ViTs outperform-from-scratch, with EfficientViT-M2 delivering the best balance of high accuracy, low computational cost (≈203.53 MFLOPs) and strong noise robustness, making it well suited for resource-constrained onboard processing. The findings provide actionable guidance for deploying robust, energy-efficient onboard EO inference, and point to future directions in multitask, multimodal, and diffusion-based transformer research for spaceborne applications.

Abstract

This study focuses on identifying the most effective pre-trained model for land use classification in onboard satellite processing, emphasizing achieving high accuracy, computational efficiency, and robustness against noisy data conditions commonly encountered during satellite-based inference. Through extensive experimentation, we compare the performance of traditional CNN-based, ResNet-based, and various pre-trained vision Transformer models. Our findings demonstrate that pre-trained Vision Transformer (ViT) models, particularly MobileViTV2 and EfficientViT-M2, outperform models trained from scratch in terms of accuracy and efficiency. These models achieve high performance with reduced computational requirements and exhibit greater resilience during inference under noisy conditions. While MobileViTV2 has excelled on clean validation data, EfficientViT-M2 has proved more robust when handling noise, making it the most suitable model for onboard satellite EO tasks. Our experimental results demonstrate that EfficientViT-M2 is the optimal choice for reliable and efficient RS-IC in satellite operations, achieving 98.76 % of accuracy, precision, and recall. Precisely, EfficientViT-M2 delivers the highest performance across all metrics, excels in training efficiency (1,000s) and inference time (10s), and demonstrates greater robustness (overall robustness score of 0.79). Consequently, EfficientViT-M2 consumes 63.93 % less power than MobileViTV2 (79.23 W) and 73.26 % less power than SwinTransformer (108.90 W). This highlights its significant advantage in energy efficiency.
Paper Structure (14 sections, 2 equations, 8 figures, 7 tables)

This paper contains 14 sections, 2 equations, 8 figures, 7 tables.

Figures (8)

  • Figure 1: Different noise levels with Gaussian and motion blur.
  • Figure 2: Different augmentation techniques.
  • Figure 3: Statistical comparison for model performance.
  • Figure 4: Statistical comparison for power consumption during inference.
  • Figure 5: Confusion matrix from MobileViTV2 (top) and EfficientViT-M2 (bottom) performance on EuroSat.
  • ...and 3 more figures