S3Former: Self-supervised High-resolution Transformer for Solar PV Profiling
Minh Tran, Adrian De Luis, Haitao Liao, Ying Huang, Roy McCann, Alan Mantooth, Jack Cothren, Ngan Le
TL;DR
This work tackles the challenge of accurately mapping solar PV installations from aerial imagery for grid-impact analysis. It introduces S3Former, an end-to-end Transformer-based segmentation model that uses a Masked Attention Mask Transformer and a self-supervised pretrained backbone to robustly locate and segment solar panels across varied GSD and weather conditions. A two-stage training pipeline combines a self-supervised pretext task with a downstream supervised segmentation task: SSL pretraining uses a teacher–student EMA setup to learn invariant aerial features, while downstream training leverages a deformable multi-scale Transformer encoder and per-pixel embeddings to produce final instance masks via a learned query mechanism. Evaluated on three public RGB datasets with differing resolutions, S3Former consistently matches or surpasses state-of-the-art PV profiling methods and conventional DL segmentation models, with pronounced gains for small or densely packed PV installations, underscoring its practical value for PV profiling and grid planning.
Abstract
As the impact of climate change escalates, the global necessity to transition to sustainable energy sources becomes increasingly evident. Renewable energies have emerged as a viable solution for users, with Photovoltaic energy being a favored choice for small installations due to its reliability and efficiency. Accurate mapping of PV installations is crucial for understanding the extension of its adoption and informing energy policy. To meet this need, we introduce S3Former, designed to segment solar panels from aerial imagery and provide size and location information critical for analyzing the impact of such installations on the grid. Solar panel identification is challenging due to factors such as varying weather conditions, roof characteristics, Ground Sampling Distance variations and lack of appropriate initialization weights for optimized training. To tackle these complexities, S3Former features a Masked Attention Mask Transformer incorporating a self-supervised learning pretrained backbone. Specifically, our model leverages low-level and high-level features extracted from the backbone and incorporates an instance query mechanism incorporated on the Transformer architecture to enhance the localization of solar PV installations. We introduce a self-supervised learning phase (pretext task) to improve the initialization weights on the backbone of S3Former. We evaluated S3Former using diverse datasets, demonstrate improvement state-of-the-art models.
