Table of Contents
Fetching ...

Self-Supervised Pretraining for Aerial Road Extraction

Rupert Polley, Sai Vignesh Abishek Deenadayalan, J. Marius Zöllner

TL;DR

The paper tackles the data bottleneck in aerial road segmentation by proposing a self-supervised pretraining workflow that first learns image structure through inpainting on unlabeled aerials, then narrows the gap to road segmentation via a guided inpainting step that uses road masks, and finally fine-tunes with segmentation labels. The approach is architecture-agnostic and demonstrates robust improvements across multiple models (e.g., SPIN RoadMapper, EmekU-Net) and datasets (DeepGlobe, CITY-OSM), especially under limited labeled data and domain shift. Key contributions include the three-step training pipeline, a dynamic inpainting masking strategy, and empirical evidence of improved road IoU and domain robustness, with inference efficiency preserved. This work offers a scalable path toward high-quality HD-map generation from abundant unlabeled aerial imagery, reducing labeling costs while maintaining performance.

Abstract

Deep neural networks for aerial image segmentation require large amounts of labeled data, but high-quality aerial datasets with precise annotations are scarce and costly to produce. To address this limitation, we propose a self-supervised pretraining method that improves segmentation performance while reducing reliance on labeled data. Our approach uses inpainting-based pretraining, where the model learns to reconstruct missing regions in aerial images, capturing their inherent structure before being fine-tuned for road extraction. This method improves generalization, enhances robustness to domain shifts, and is invariant to model architecture and dataset choice. Experiments show that our pretraining significantly boosts segmentation accuracy, especially in low-data regimes, making it a scalable solution for aerial image analysis.

Self-Supervised Pretraining for Aerial Road Extraction

TL;DR

The paper tackles the data bottleneck in aerial road segmentation by proposing a self-supervised pretraining workflow that first learns image structure through inpainting on unlabeled aerials, then narrows the gap to road segmentation via a guided inpainting step that uses road masks, and finally fine-tunes with segmentation labels. The approach is architecture-agnostic and demonstrates robust improvements across multiple models (e.g., SPIN RoadMapper, EmekU-Net) and datasets (DeepGlobe, CITY-OSM), especially under limited labeled data and domain shift. Key contributions include the three-step training pipeline, a dynamic inpainting masking strategy, and empirical evidence of improved road IoU and domain robustness, with inference efficiency preserved. This work offers a scalable path toward high-quality HD-map generation from abundant unlabeled aerial imagery, reducing labeling costs while maintaining performance.

Abstract

Deep neural networks for aerial image segmentation require large amounts of labeled data, but high-quality aerial datasets with precise annotations are scarce and costly to produce. To address this limitation, we propose a self-supervised pretraining method that improves segmentation performance while reducing reliance on labeled data. Our approach uses inpainting-based pretraining, where the model learns to reconstruct missing regions in aerial images, capturing their inherent structure before being fine-tuned for road extraction. This method improves generalization, enhances robustness to domain shifts, and is invariant to model architecture and dataset choice. Experiments show that our pretraining significantly boosts segmentation accuracy, especially in low-data regimes, making it a scalable solution for aerial image analysis.

Paper Structure

This paper contains 18 sections, 6 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: An overview of our proposed training method. In Step 1, a model is pretrained using inpainting on unlabeled aerial images. In Step 2, road segmentation labels are incorporated to guide the inpainting task specifically toward road structures. Finally, after pretraining on inpainting tasks, the model is fine-tuned on the road segmentation task. The symbol $\odot$ denotes the element-wise product with a binary mask.
  • Figure 2: While the size of the clusters during training increases, the number of clusters decreases. Epochs from left to right are 10, 30, 50.