Table of Contents
Fetching ...

Terrain-Informed Self-Supervised Learning: Enhancing Building Footprint Extraction from LiDAR Data with Limited Annotations

Anuja Vats, David Völgyes, Martijn Vermeer, Marius Pedersen, Kiran Raja, Daniele S. M. Fantin, Jacob Alexander Hay

TL;DR

This paper addresses the challenge of building footprint extraction from LiDAR under limited annotations and domain shifts by introducing terrain-aware self-supervised learning. It learns domain-relevant features by reconstructing terrain (DTM) from surface elevations (DSM) using a U-Net with a ResNet-50 encoder, augmented with SE attention and a dual loss regime (smooth-$L_1$ and LPIPS). The pretrained features are fine-tuned for downstream segmentation, achieving strong performance at very low label fractions (e.g., $1\%$ or 25 labels) and showing robust generalization to distribution shifts (T2/MapAI) and to comparisons with larger architectures. The results demonstrate that LiDAR-alone, terrain-informed SSL can rival or exceed ImageNet pretraining, enabling efficient, scalable building footprint extraction for remote sensing applications.

Abstract

Estimating building footprint maps from geospatial data is of paramount importance in urban planning, development, disaster management, and various other applications. Deep learning methodologies have gained prominence in building segmentation maps, offering the promise of precise footprint extraction without extensive post-processing. However, these methods face challenges in generalization and label efficiency, particularly in remote sensing, where obtaining accurate labels can be both expensive and time-consuming. To address these challenges, we propose terrain-aware self-supervised learning, tailored to remote sensing, using digital elevation models from LiDAR data. We propose to learn a model to differentiate between bare Earth and superimposed structures enabling the network to implicitly learn domain-relevant features without the need for extensive pixel-level annotations. We test the effectiveness of our approach by evaluating building segmentation performance on test datasets with varying label fractions. Remarkably, with only 1% of the labels (equivalent to 25 labeled examples), our method improves over ImageNet pre-training, showing the advantage of leveraging unlabeled data for feature extraction in the domain of remote sensing. The performance improvement is more pronounced in few-shot scenarios and gradually closes the gap with ImageNet pre-training as the label fraction increases. We test on a dataset characterized by substantial distribution shifts and labeling errors to demonstrate the generalizability of our approach. When compared to other baselines, including ImageNet pretraining and more complex architectures, our approach consistently performs better, demonstrating the efficiency and effectiveness of self-supervised terrain-aware feature learning.

Terrain-Informed Self-Supervised Learning: Enhancing Building Footprint Extraction from LiDAR Data with Limited Annotations

TL;DR

This paper addresses the challenge of building footprint extraction from LiDAR under limited annotations and domain shifts by introducing terrain-aware self-supervised learning. It learns domain-relevant features by reconstructing terrain (DTM) from surface elevations (DSM) using a U-Net with a ResNet-50 encoder, augmented with SE attention and a dual loss regime (smooth- and LPIPS). The pretrained features are fine-tuned for downstream segmentation, achieving strong performance at very low label fractions (e.g., or 25 labels) and showing robust generalization to distribution shifts (T2/MapAI) and to comparisons with larger architectures. The results demonstrate that LiDAR-alone, terrain-informed SSL can rival or exceed ImageNet pretraining, enabling efficient, scalable building footprint extraction for remote sensing applications.

Abstract

Estimating building footprint maps from geospatial data is of paramount importance in urban planning, development, disaster management, and various other applications. Deep learning methodologies have gained prominence in building segmentation maps, offering the promise of precise footprint extraction without extensive post-processing. However, these methods face challenges in generalization and label efficiency, particularly in remote sensing, where obtaining accurate labels can be both expensive and time-consuming. To address these challenges, we propose terrain-aware self-supervised learning, tailored to remote sensing, using digital elevation models from LiDAR data. We propose to learn a model to differentiate between bare Earth and superimposed structures enabling the network to implicitly learn domain-relevant features without the need for extensive pixel-level annotations. We test the effectiveness of our approach by evaluating building segmentation performance on test datasets with varying label fractions. Remarkably, with only 1% of the labels (equivalent to 25 labeled examples), our method improves over ImageNet pre-training, showing the advantage of leveraging unlabeled data for feature extraction in the domain of remote sensing. The performance improvement is more pronounced in few-shot scenarios and gradually closes the gap with ImageNet pre-training as the label fraction increases. We test on a dataset characterized by substantial distribution shifts and labeling errors to demonstrate the generalizability of our approach. When compared to other baselines, including ImageNet pretraining and more complex architectures, our approach consistently performs better, demonstrating the efficiency and effectiveness of self-supervised terrain-aware feature learning.
Paper Structure (11 sections, 5 equations, 8 figures, 2 tables)

This paper contains 11 sections, 5 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Different digital elevation models characterizing different aspects of the Earth's surface derived from LiDAR point clouds.
  • Figure 2: Scale differences as seen in our datasets. The top row shows LiDAR images at different scales from the train (left) and test (right) datasets. The corresponding building labels are shown in the bottom row. Even though our approach is pretrained for terrain-recovery on images as seen on the left, it translates easily to building with scales and sizes as seen on the right, indicating generalizability to objects at different scales (The magnification is to enable visual comparison between the two scales and aid reader comprehension and is not drawn to scale).
  • Figure 3: Our approach: We employ U-Net with Resnet-50 encoder for image reconstruction task from LiDAR surface image to terrain image. Through formulating the pretraining as a reconstruction task, all the layers within the encoder and decoder (except for the last task-specific layer) can be utilized in the downstream tasks as shown. Conv stands for convolutional operation, BN stands for batch normalization.
  • Figure 4: Training dataset (D1) and testing datasets (T1 and T2) : During pretraining the model learns to reconstruct DTM from DSM at 1 m ground resolution. Both testing datasets T1 and T2 input nDSM images for building segmentation, introducing a shift from training data. Further, T2 additionally introduces resolution shift and label noise for testing the generalizability of our approach. Some of the labeling errors can be seen in the figure where added building refers to buildings that are found in the label but are absent in the LiDAR images.
  • Figure 5: Predicted segmentation masks from our approach on T2 (MapAI) dataset.
  • ...and 3 more figures