Table of Contents
Fetching ...

S3Former: Self-supervised High-resolution Transformer for Solar PV Profiling

Minh Tran, Adrian De Luis, Haitao Liao, Ying Huang, Roy McCann, Alan Mantooth, Jack Cothren, Ngan Le

TL;DR

This work tackles the challenge of accurately mapping solar PV installations from aerial imagery for grid-impact analysis. It introduces S3Former, an end-to-end Transformer-based segmentation model that uses a Masked Attention Mask Transformer and a self-supervised pretrained backbone to robustly locate and segment solar panels across varied GSD and weather conditions. A two-stage training pipeline combines a self-supervised pretext task with a downstream supervised segmentation task: SSL pretraining uses a teacher–student EMA setup to learn invariant aerial features, while downstream training leverages a deformable multi-scale Transformer encoder and per-pixel embeddings to produce final instance masks via a learned query mechanism. Evaluated on three public RGB datasets with differing resolutions, S3Former consistently matches or surpasses state-of-the-art PV profiling methods and conventional DL segmentation models, with pronounced gains for small or densely packed PV installations, underscoring its practical value for PV profiling and grid planning.

Abstract

As the impact of climate change escalates, the global necessity to transition to sustainable energy sources becomes increasingly evident. Renewable energies have emerged as a viable solution for users, with Photovoltaic energy being a favored choice for small installations due to its reliability and efficiency. Accurate mapping of PV installations is crucial for understanding the extension of its adoption and informing energy policy. To meet this need, we introduce S3Former, designed to segment solar panels from aerial imagery and provide size and location information critical for analyzing the impact of such installations on the grid. Solar panel identification is challenging due to factors such as varying weather conditions, roof characteristics, Ground Sampling Distance variations and lack of appropriate initialization weights for optimized training. To tackle these complexities, S3Former features a Masked Attention Mask Transformer incorporating a self-supervised learning pretrained backbone. Specifically, our model leverages low-level and high-level features extracted from the backbone and incorporates an instance query mechanism incorporated on the Transformer architecture to enhance the localization of solar PV installations. We introduce a self-supervised learning phase (pretext task) to improve the initialization weights on the backbone of S3Former. We evaluated S3Former using diverse datasets, demonstrate improvement state-of-the-art models.

S3Former: Self-supervised High-resolution Transformer for Solar PV Profiling

TL;DR

This work tackles the challenge of accurately mapping solar PV installations from aerial imagery for grid-impact analysis. It introduces S3Former, an end-to-end Transformer-based segmentation model that uses a Masked Attention Mask Transformer and a self-supervised pretrained backbone to robustly locate and segment solar panels across varied GSD and weather conditions. A two-stage training pipeline combines a self-supervised pretext task with a downstream supervised segmentation task: SSL pretraining uses a teacher–student EMA setup to learn invariant aerial features, while downstream training leverages a deformable multi-scale Transformer encoder and per-pixel embeddings to produce final instance masks via a learned query mechanism. Evaluated on three public RGB datasets with differing resolutions, S3Former consistently matches or surpasses state-of-the-art PV profiling methods and conventional DL segmentation models, with pronounced gains for small or densely packed PV installations, underscoring its practical value for PV profiling and grid planning.

Abstract

As the impact of climate change escalates, the global necessity to transition to sustainable energy sources becomes increasingly evident. Renewable energies have emerged as a viable solution for users, with Photovoltaic energy being a favored choice for small installations due to its reliability and efficiency. Accurate mapping of PV installations is crucial for understanding the extension of its adoption and informing energy policy. To meet this need, we introduce S3Former, designed to segment solar panels from aerial imagery and provide size and location information critical for analyzing the impact of such installations on the grid. Solar panel identification is challenging due to factors such as varying weather conditions, roof characteristics, Ground Sampling Distance variations and lack of appropriate initialization weights for optimized training. To tackle these complexities, S3Former features a Masked Attention Mask Transformer incorporating a self-supervised learning pretrained backbone. Specifically, our model leverages low-level and high-level features extracted from the backbone and incorporates an instance query mechanism incorporated on the Transformer architecture to enhance the localization of solar PV installations. We introduce a self-supervised learning phase (pretext task) to improve the initialization weights on the backbone of S3Former. We evaluated S3Former using diverse datasets, demonstrate improvement state-of-the-art models.
Paper Structure (16 sections, 10 equations, 5 figures, 4 tables)

This paper contains 16 sections, 10 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Examples of challenging characteristics of solar PV segmentation on (a) IGN France, (b) GGE France, (c) USGS California.. Within a class, there is a large diversity in appearance: intra-class heterogeneity (red), some different classes share the similar appearance: inter-class homogeneity (green), solar PV are dense and small such that they are hardly identifiable (blue).
  • Figure 2: Examples of six augmentations for the pretext task self-supervised learning. (a): original image. (b) top line: color jitter; bottom line, from left to right: random cropping, Gausian noise, horizontal flip.
  • Figure 3: Overall pipeline of the proposed S3Former. The training pipeline includes two phases: Pretext task and Downstream task. The goal of the pretext task is to learn the optimal parameter such that the backbone network can extract similar representations from both $\mathbf{V}_s$ and $\mathbf{V}_t$, regardless of non-semantic factors introduced by augmentation. In the downstream task, the Pixel Decoder learns the correlations within multi-scale feature maps and utilizes them to decode enriched feature maps. Following this, the Masked Attention Mask Transformer integrates these decoded feature maps with $N$ learnable queries, employing a masked attention mechanism to accurately predict the segmentation mask.
  • Figure 4: Qualitative comparison on (a) IGN France, (b) GGE France, (c) USGS California. From top to bottom: (i) Original RGB Image, (ii) Groundtruth, (iii) Upernet (xiao2018unified), (iv) DeepLabv3+ (chen2017rethinking) and (v) S3Former.
  • Figure 5: Extended qualitative comparison on different datasets (a) IGN France, (b) GGE France, (c) USGS California. From left to right: RGB Image, Ground-Truth, w/o pretext and w/ pretext. Special cases highlighting the strength of both models and improvements of S3Former with respect to pretext task pretraining were selected: inter-class homogeneity (green), intra-class heterogeneity (red), and small-object identification (blue). Cases of missing annotations on the data were highlighted (purple).