Self-Supervised Pretraining for Aerial Road Extraction
Rupert Polley, Sai Vignesh Abishek Deenadayalan, J. Marius Zöllner
TL;DR
The paper tackles the data bottleneck in aerial road segmentation by proposing a self-supervised pretraining workflow that first learns image structure through inpainting on unlabeled aerials, then narrows the gap to road segmentation via a guided inpainting step that uses road masks, and finally fine-tunes with segmentation labels. The approach is architecture-agnostic and demonstrates robust improvements across multiple models (e.g., SPIN RoadMapper, EmekU-Net) and datasets (DeepGlobe, CITY-OSM), especially under limited labeled data and domain shift. Key contributions include the three-step training pipeline, a dynamic inpainting masking strategy, and empirical evidence of improved road IoU and domain robustness, with inference efficiency preserved. This work offers a scalable path toward high-quality HD-map generation from abundant unlabeled aerial imagery, reducing labeling costs while maintaining performance.
Abstract
Deep neural networks for aerial image segmentation require large amounts of labeled data, but high-quality aerial datasets with precise annotations are scarce and costly to produce. To address this limitation, we propose a self-supervised pretraining method that improves segmentation performance while reducing reliance on labeled data. Our approach uses inpainting-based pretraining, where the model learns to reconstruct missing regions in aerial images, capturing their inherent structure before being fine-tuned for road extraction. This method improves generalization, enhances robustness to domain shifts, and is invariant to model architecture and dataset choice. Experiments show that our pretraining significantly boosts segmentation accuracy, especially in low-data regimes, making it a scalable solution for aerial image analysis.
