When are Foundation Models Effective? Understanding the Suitability for Pixel-Level Classification Using Multispectral Imagery
Yiqun Xie, Zhihao Wang, Weiye Chen, Zhili Li, Xiaowei Jia, Yanhua Li, Ruichen Wang, Kangyang Chai, Ruohan Li, Sergii Skakun
TL;DR
The paper investigates when foundation models are effective for pixel-level classification on multispectral imagery. It compares three foundation models (Prithvi, SegFormer, and ViT variants) with traditional ML and regular DL approaches across crop, burn scar, and flood tasks, using standardized 100-epoch finetuning and metrics such as IoU and F1. The findings show foundation models do not consistently outperform traditional ML; they can excel when texture is informative but often do not beat regular DL, and their advantage over traditional methods is not robust across tasks. The authors argue that success hinges on aligning SSL objectives with downstream tasks, and that MAE-style pretraining may not align well with many remote-sensing problems, underscoring the need for task-specific SSL designs and continued evaluation of domain-focused models like Prithvi.
Abstract
Foundation models, i.e., very large deep learning models, have demonstrated impressive performances in various language and vision tasks that are otherwise difficult to reach using smaller-size models. The major success of GPT-type of language models is particularly exciting and raises expectations on the potential of foundation models in other domains including satellite remote sensing. In this context, great efforts have been made to build foundation models to test their capabilities in broader applications, and examples include Prithvi by NASA-IBM, Segment-Anything-Model, ViT, etc. This leads to an important question: Are foundation models always a suitable choice for different remote sensing tasks, and when or when not? This work aims to enhance the understanding of the status and suitability of foundation models for pixel-level classification using multispectral imagery at moderate resolution, through comparisons with traditional machine learning (ML) and regular-size deep learning models. Interestingly, the results reveal that in many scenarios traditional ML models still have similar or better performance compared to foundation models, especially for tasks where texture is less useful for classification. On the other hand, deep learning models did show more promising results for tasks where labels partially depend on texture (e.g., burn scar), while the difference in performance between foundation models and deep learning models is not obvious. The results conform with our analysis: The suitability of foundation models depend on the alignment between the self-supervised learning tasks and the real downstream tasks, and the typical masked autoencoder paradigm is not necessarily suitable for many remote sensing problems.
