Table of Contents
Fetching ...

Supervised domain adaptation for building extraction from off-nadir aerial images

Bipul Neupane, Jagannath Aryal, Abbas Rajabifard

TL;DR

A supervised domain adaptation (SDA) of encoder-decoder networks (EDNs) between noisy and clean datasets to tackle the problem and the experimental findings will be beneficial in training robust CNNs for building extraction.

Abstract

Building extraction $-$ needed for inventory management and planning of urban environment $-$ is affected by the misalignment between labels and off-nadir source imagery in training data. Teacher-Student learning of noise-tolerant convolutional neural networks (CNNs) is the existing solution, but the Student networks typically have lower accuracy and cannot surpass the Teacher's performance. This paper proposes a supervised domain adaptation (SDA) of encoder-decoder networks (EDNs) between noisy and clean datasets to tackle the problem. EDNs are configured with high-performing lightweight encoders such as EfficientNet, ResNeSt, and MobileViT. The proposed method is compared against the existing Teacher-Student learning methods like knowledge distillation (KD) and deep mutual learning (DML) with three newly developed datasets. The methods are evaluated for different urban buildings (low-rise, mid-rise, high-rise, and skyscrapers), where misalignment increases with the increase in building height and spatial resolution. For a robust experimental design, 43 lightweight CNNs, five optimisers, nine loss functions, and seven EDNs are benchmarked to obtain the best-performing EDN for SDA. The SDA of the best-performing EDN from our study significantly outperformed KD and DML with up to 0.943, 0.868, 0.912, and 0.697 F1 scores in the low-rise, mid-rise, high-rise, and skyscrapers respectively. The proposed method and the experimental findings will be beneficial in training robust CNNs for building extraction.

Supervised domain adaptation for building extraction from off-nadir aerial images

TL;DR

A supervised domain adaptation (SDA) of encoder-decoder networks (EDNs) between noisy and clean datasets to tackle the problem and the experimental findings will be beneficial in training robust CNNs for building extraction.

Abstract

Building extraction needed for inventory management and planning of urban environment is affected by the misalignment between labels and off-nadir source imagery in training data. Teacher-Student learning of noise-tolerant convolutional neural networks (CNNs) is the existing solution, but the Student networks typically have lower accuracy and cannot surpass the Teacher's performance. This paper proposes a supervised domain adaptation (SDA) of encoder-decoder networks (EDNs) between noisy and clean datasets to tackle the problem. EDNs are configured with high-performing lightweight encoders such as EfficientNet, ResNeSt, and MobileViT. The proposed method is compared against the existing Teacher-Student learning methods like knowledge distillation (KD) and deep mutual learning (DML) with three newly developed datasets. The methods are evaluated for different urban buildings (low-rise, mid-rise, high-rise, and skyscrapers), where misalignment increases with the increase in building height and spatial resolution. For a robust experimental design, 43 lightweight CNNs, five optimisers, nine loss functions, and seven EDNs are benchmarked to obtain the best-performing EDN for SDA. The SDA of the best-performing EDN from our study significantly outperformed KD and DML with up to 0.943, 0.868, 0.912, and 0.697 F1 scores in the low-rise, mid-rise, high-rise, and skyscrapers respectively. The proposed method and the experimental findings will be beneficial in training robust CNNs for building extraction.
Paper Structure (17 sections, 1 equation, 10 figures, 5 tables)

This paper contains 17 sections, 1 equation, 10 figures, 5 tables.

Figures (10)

  • Figure 1: Misalignment of building labels with an increase in spatial resolution of aerial image and height of the building. The numbers in the labels show the height of the building and the length of the red line shows the displacement of labels. As an illustration, the building with a height of 63.88 m in 120 cm (left) spatial resolution image is displaced less in comparison with 60 cm (middle) and 30 cm (right) spatial resolution images. The images are collected from Nearmap Tile API over the City of Melbourne, Australia.
  • Figure 2: Study area and sample image-label pairs of the Teacher's dataset (T), Student's dataset (S), and Evaluation dataset (Ev). (a) City of Melbourne's Census of Land Use and Employment (CLUE) boundaries and CBD area. (b) Sample image-label pairs.
  • Figure 3: An illustrative example of the SDA with EDN formulated with different options of encoder CNN. (a) SDA of a U-Net EDN with all layers of the pre-trained Teacher being adapted to the Student dataset. (b) Encoder option 1 of EfficientNetv2 tan2021efficientnetv2 CNN from Google Brain. (c) Encoder option N of MobileViT mehta2021mobilevit CNN from Apple.
  • Figure 4: Segmentation results of the selected lightweight Students on the subset of Massachusetts building dataset. U-EfficientNetv2B3 produced the highest F1 score of 0.965. Other Students produced lower scores but with trade-offs between network parameters, training time, loss, and evaluation scores as seen in Table \ref{['tab:ednencoder']}.
  • Figure 5: Teacher search with VGG-19 (left) and EfficientNetv2B3 (right) as CNNs for the state-of-the-art EDNs. The heat maps on top compare the evaluation scores of the EDNs. The parallel coordinate plots on the bottom show the trade-off between network parameters (Par.), loss, training time (ms/it), and F1 score. The network with the dashed line (U-Net) provides the best trade-off among the variables.
  • ...and 5 more figures