Table of Contents
Fetching ...

Seeing the roads through the trees: A benchmark for modeling spatial dependencies with aerial imagery

Caleb Robinson, Isaac Corley, Anthony Ortiz, Rahul Dodhia, Juan M. Lavista Ferres, Peyman Najafirad

TL;DR

Robust spatial reasoning over long-range context is essential for road segmentation in aerial imagery when occlusions occur. The authors introduce the Chesapeake Roads Spatial Context (RSC) benchmark, a 30k 512×512 NAIP patch dataset with three labels and a distance-weighted recall metric to probe long-range context use, and benchmark several semantic segmentation models. They find that performance on the occluded 'tree canopy over road' class declines as the distance from existing road context increases, with distant samples achieving only 30–40% recall across models. The work provides code and data to advance research in long-range spatial reasoning for geospatial ML and highlights the need for architectures that capture wider spatial dependencies.

Abstract

Fully understanding a complex high-resolution satellite or aerial imagery scene often requires spatial reasoning over a broad relevant context. The human object recognition system is able to understand object in a scene over a long-range relevant context. For example, if a human observes an aerial scene that shows sections of road broken up by tree canopy, then they will be unlikely to conclude that the road has actually been broken up into disjoint pieces by trees and instead think that the canopy of nearby trees is occluding the road. However, there is limited research being conducted to understand long-range context understanding of modern machine learning models. In this work we propose a road segmentation benchmark dataset, Chesapeake Roads Spatial Context (RSC), for evaluating the spatial long-range context understanding of geospatial machine learning models and show how commonly used semantic segmentation models can fail at this task. For example, we show that a U-Net trained to segment roads from background in aerial imagery achieves an 84% recall on unoccluded roads, but just 63.5% recall on roads covered by tree canopy despite being trained to model both the same way. We further analyze how the performance of models changes as the relevant context for a decision (unoccluded roads in our case) varies in distance. We release the code to reproduce our experiments and dataset of imagery and masks to encourage future research in this direction -- https://github.com/isaaccorley/ChesapeakeRSC.

Seeing the roads through the trees: A benchmark for modeling spatial dependencies with aerial imagery

TL;DR

Robust spatial reasoning over long-range context is essential for road segmentation in aerial imagery when occlusions occur. The authors introduce the Chesapeake Roads Spatial Context (RSC) benchmark, a 30k 512×512 NAIP patch dataset with three labels and a distance-weighted recall metric to probe long-range context use, and benchmark several semantic segmentation models. They find that performance on the occluded 'tree canopy over road' class declines as the distance from existing road context increases, with distant samples achieving only 30–40% recall across models. The work provides code and data to advance research in long-range spatial reasoning for geospatial ML and highlights the need for architectures that capture wider spatial dependencies.

Abstract

Fully understanding a complex high-resolution satellite or aerial imagery scene often requires spatial reasoning over a broad relevant context. The human object recognition system is able to understand object in a scene over a long-range relevant context. For example, if a human observes an aerial scene that shows sections of road broken up by tree canopy, then they will be unlikely to conclude that the road has actually been broken up into disjoint pieces by trees and instead think that the canopy of nearby trees is occluding the road. However, there is limited research being conducted to understand long-range context understanding of modern machine learning models. In this work we propose a road segmentation benchmark dataset, Chesapeake Roads Spatial Context (RSC), for evaluating the spatial long-range context understanding of geospatial machine learning models and show how commonly used semantic segmentation models can fail at this task. For example, we show that a U-Net trained to segment roads from background in aerial imagery achieves an 84% recall on unoccluded roads, but just 63.5% recall on roads covered by tree canopy despite being trained to model both the same way. We further analyze how the performance of models changes as the relevant context for a decision (unoccluded roads in our case) varies in distance. We release the code to reproduce our experiments and dataset of imagery and masks to encourage future research in this direction -- https://github.com/isaaccorley/ChesapeakeRSC.
Paper Structure (6 sections, 4 figures, 1 table)

This paper contains 6 sections, 4 figures, 1 table.

Figures (4)

  • Figure 1: Map of the distribution of the 30,000 train, validation, and test patches in the Chesapeake Roads Spatial Context (RSC) dataset.
  • Figure 2: Example images and labels from the proposed Chesapeake Roads Spatial Context (RSC) dataset. Labels are shown over the corresponding NAIP aerial imagery with the "Road" class colored in blue and the "Tree Canopy over Road" class in red.
  • Figure 3: Example predictions (shown highlighted in blue) from the U-Net ResNet-18 model on held out (test set) imagery. Note that the model is only trained to discriminate between "road" (including occluded and unoccluded roads) and "background" classes.
  • Figure 4: Recall of each model on the "Tree Canopy over Road" class shown as a function of distance from the nearest road class. Over all models, the performance on this class drops the farther away from the nearest road pixel the classification is made. For example, while the DeepLabv3+ ResNet-18 has $>70\%$ recall on "tree canopy over road" pixels that are adjacent to "road" pixels, the recall drops below $50\%$ for "tree canopy over road" pixels that are over 15 pixels away from a "road" pixel.