Table of Contents
Fetching ...

Semantic Labeling of Large-Area Geographic Regions Using Multi-View and Multi-Date Satellite Images and Noisy OSM Training Labels

Bharath Comandur, Avinash C. Kak

TL;DR

This work presents a novel multiview training framework and convolutional neural network architecture for combining information from multiple overlapping satellite images and noisy training labels derived from OpenStreetMap to semantically label buildings and roads across large geographic regions.

Abstract

We present a novel multi-view training framework and CNN architecture for combining information from multiple overlapping satellite images and noisy training labels derived from OpenStreetMap (OSM) to semantically label buildings and roads across large geographic regions (100 km$^2$). Our approach to multi-view semantic segmentation yields a 4-7% improvement in the per-class IoU scores compared to the traditional approaches that use the views independently of one another. A unique (and, perhaps, surprising) property of our system is that modifications that are added to the tail-end of the CNN for learning from the multi-view data can be discarded at the time of inference with a relatively small penalty in the overall performance. This implies that the benefits of training using multiple views are absorbed by all the layers of the network. Additionally, our approach only adds a small overhead in terms of the GPU-memory consumption even when training with as many as 32 views per scene. The system we present is end-to-end automated, which facilitates comparing the classifiers trained directly on true orthophotos vis-a-vis first training them on the off-nadir images and subsequently translating the predicted labels to geographical coordinates. With no human supervision, our IoU scores for the buildings and roads classes are 0.8 and 0.64 respectively which are better than state-of-the-art approaches that use OSM labels and that are not completely automated.

Semantic Labeling of Large-Area Geographic Regions Using Multi-View and Multi-Date Satellite Images and Noisy OSM Training Labels

TL;DR

This work presents a novel multiview training framework and convolutional neural network architecture for combining information from multiple overlapping satellite images and noisy training labels derived from OpenStreetMap to semantically label buildings and roads across large geographic regions.

Abstract

We present a novel multi-view training framework and CNN architecture for combining information from multiple overlapping satellite images and noisy training labels derived from OpenStreetMap (OSM) to semantically label buildings and roads across large geographic regions (100 km). Our approach to multi-view semantic segmentation yields a 4-7% improvement in the per-class IoU scores compared to the traditional approaches that use the views independently of one another. A unique (and, perhaps, surprising) property of our system is that modifications that are added to the tail-end of the CNN for learning from the multi-view data can be discarded at the time of inference with a relatively small penalty in the overall performance. This implies that the benefits of training using multiple views are absorbed by all the layers of the network. Additionally, our approach only adds a small overhead in terms of the GPU-memory consumption even when training with as many as 32 views per scene. The system we present is end-to-end automated, which facilitates comparing the classifiers trained directly on true orthophotos vis-a-vis first training them on the off-nadir images and subsequently translating the predicted labels to geographical coordinates. With no human supervision, our IoU scores for the buildings and roads classes are 0.8 and 0.64 respectively which are better than state-of-the-art approaches that use OSM labels and that are not completely automated.

Paper Structure

This paper contains 24 sections, 7 equations, 10 figures, 5 tables.

Figures (10)

  • Figure 1: To illustrate the power of our approach, the buildings in the bottom row were extracted by our approach based on multi-view training for semantic labeling. Compare with the top row where the training is based on single-views. Building points are marked in translucent blue.
  • Figure 2: We have uploaded as Supporting Material the flyby videos and the images of the DSMs for two large areas, a 120 km$^2$ area from Ohio and a 62 km$^2$ area from California. The flyby videos can also be viewed at the link at flyby. The top two images depict two small sections from the Ohio DSM, and the bottom two images depict two small sections from the California DSM. The DSM depictions have been colored according to the elevation values within the boundaries of each section.
  • Figure 3: Overview of our framework. The three inputs are shown in orange-colored boxes. All outputs produced by the system are shown in green-colored boxes. The modules in blue-colored ellipses operate on a tile-wise basis.
  • Figure 4: Overview of Multi-View Training
  • Figure 5: Two choices for Multi-View Fusion. At top is MV-A in which the weights of the MV Fusion layer are different for each channel of each view. At bottom is MV-B where the weights of the MV Fusion layer are shared by all the channels of a view.
  • ...and 5 more figures