A Billion-scale Foundation Model for Remote Sensing Images

Keumgang Cha; Junghoon Seo; Taekyung Lee

A Billion-scale Foundation Model for Remote Sensing Images

Keumgang Cha, Junghoon Seo, Taekyung Lee

TL;DR

<3-5 sentence high-level summary>This work addresses the gap in remote sensing foundation models by examining how increasing the number of model parameters affects downstream tasks. The authors pretrain a billion-parameter ViT backbone using MAE on the MillionAID dataset and refine it with ViTDET, enabling effective rotated object detection and semantic segmentation. They demonstrate consistent performance gains with larger parameter counts across DOTA v2.0, DIOR-R, Potsdam, and LoveDA, and show improved data efficiency in low-data regimes. This work signals that domain-specific, large-scale pretraining combined with parallelized transformer scaling can establish strong RS foundation models with practical impact for high-resolution geospatial analysis.

Abstract

As the potential of foundation models in visual tasks has garnered significant attention, pretraining these models before downstream tasks has become a crucial step. The three key factors in pretraining foundation models are the pretraining method, the size of the pretraining dataset, and the number of model parameters. Recently, research in the remote sensing field has focused primarily on the pretraining method and the size of the dataset, with limited emphasis on the number of model parameters. This paper addresses this gap by examining the effect of increasing the number of model parameters on the performance of foundation models in downstream tasks such as rotated object detection and semantic segmentation. We pretrained foundation models with varying numbers of parameters, including 86M, 605.26M, 1.3B, and 2.4B, to determine whether performance in downstream tasks improved with an increase in parameters. To the best of our knowledge, this is the first billion-scale foundation model in the remote sensing field. Furthermore, we propose an effective method for scaling up and fine-tuning a vision transformer in the remote sensing field. To evaluate general performance in downstream tasks, we employed the DOTA v2.0 and DIOR-R benchmark datasets for rotated object detection, and the Potsdam and LoveDA datasets for semantic segmentation. Experimental results demonstrated that, across all benchmark datasets and downstream tasks, the performance of the foundation models and data efficiency improved as the number of parameters increased. Moreover, our models achieve the state-of-the-art performance on several datasets including DIOR-R, Postdam, and LoveDA.

A Billion-scale Foundation Model for Remote Sensing Images

TL;DR

Abstract

Paper Structure (21 sections, 3 equations, 6 figures, 11 tables)

This paper contains 21 sections, 3 equations, 6 figures, 11 tables.

introduction
Related Works
Self-supervised Learning in Computer Vision
Foundation Models in Remote Sensing
Methods
Self Supervised Learning by MAE
MillionAID
MAE
Scaling Up Vision Transformer
Implementation Detail for Pretraining
Fine Tuning Vision Transformer for Object Localization
Experimental Results
Rotated Object Detection
Dataset
Implementation Details and Experiment Settings
...and 6 more sections

Figures (6)

Figure 1: The given figure shows the variation in the size of foundation models over the years, with blue and red representing the number of parameters in computer vision and remote sensing models, respectively. While the billion-scale foundation model is already being studied in computer vision, it has not yet been developed in remote sensing. More detailed information can be found \ref{['tab:params_vs_model']}. Models with fewer than 1 billion parameters are omitted.
Figure 2: A brief introduction of self supervised learning, such as contrastive learning, self-distillation and masked image modeling in computer vision. (a) In contrastive learning, the positive pairs of data are brought closer together while the negative pairs are pushed further apart. (b) Self-distillation is a process of training a model to predict the relationships between multiple views of an unlabeled image. (c) Masked image modeling involves masking a portion of an image and then using a process to reconstruct the masked section.
Figure 3: This figure explains how to effectively increase the number of parameters of the vision transformer, and the two models have substantially the same amount of computation and number of parameters. In the field of natural language processing, multi head self attention and feed forward blocks are configured only serially, but there is the difference of performance even if they are configured in parallel. Like 12 layers with 1 parallelism and 6 layers with 2 parallelism, if the same number is obtained when multiplying the layer and parallelism, the backbone has the same number of parameters and the same flops.
Figure 4: This figure shows overall flows for pretraining and downstream tasks. The plain vision transformer is pretrained by MAE with remote sensing imagery dataset, MillionAID. Then, plain vision transformer is converted to ViTDET structure with local and global attention for downstream tasks which is rotated object detection and semantic segmentation. In order to upsample and downsample features, scale blocks are adopted after ViTDET backbone. The scale block 1 consists of serially connected transposed convolution, normalization, GELU, and transposed convolutionhendrycks2016gaussianlong2015fully. The scale block 2 is only transposed convolution. The scale block 3 is identity block. The scale block 4 is max pooling with kernel size 2. All transposed convolutions used in scale block is with kernel size 2 and stride size 2.
Figure 5: Visualization results of the proposed model. The first through third rows are the results of the DOTA v2.0 dataset. Since the label of test dataset in DOTA v2.0 is unavailable, the images from left to right are ViT-B12$\times$1, ViT-L12$\times$4, ViT-H12$\times$4, and ViT-G12$\times$4. The fourth to sixth rows are the results of the DIOR-R dataset. The images from left to right are label, ViT-B12$\times$1, ViT-L12$\times$4, ViT-H12$\times$4, and ViT-G12$\times$4.
...and 1 more figures

A Billion-scale Foundation Model for Remote Sensing Images

TL;DR

Abstract

A Billion-scale Foundation Model for Remote Sensing Images

Authors

TL;DR

Abstract

Table of Contents

Figures (6)