SegForestNet: Spatial-Partitioning-Based Aerial Image Segmentation
Daniel Gritzner, Jörn Ostermann
TL;DR
SegForestNet advances aerial image segmentation by enforcing polygon-like predictions through BSP trees, enabling a robust, end-to-end differentiable pipeline. The authors introduce three refinements—a residual decoder architecture with improved gradient flow, a region-map–specific loss to sharpen partitions, and the ability to predict class-specific BSP trees—along with an analysis showing training process quality can surpass domain-specific architectural gains. Evaluations across eight diverse datasets show state-of-the-art or competitive performance, with notable benefits for small rectangular objects like cars; they also demonstrate that optimized training can level the playing field with generic segmentation models. The work suggests practical impact in map construction and environmental monitoring, and points to future directions in learning optimal tree types and extending to instance or panoptic segmentation.
Abstract
Aerial image segmentation is the basis for applications such as automatically creating maps or tracking deforestation. In true orthophotos, which are often used in these applications, many objects and regions can be approximated well by polygons. However, this fact is rarely exploited by state-of-the-art semantic segmentation models. Instead, most models allow unnecessary degrees of freedom in their predictions by allowing arbitrary region shapes. We therefore present a refinement of our deep learning model which predicts binary space partitioning trees, an efficient polygon representation. The refinements include a new feature decoder architecture and a new differentiable BSP tree renderer which both avoid vanishing gradients. Additionally, we designed a novel loss function specifically designed to improve the spatial partitioning defined by the predicted trees. Furthermore, our expanded model can predict multiple trees at once and thus can predict class-specific segmentations. As an additional contribution, we investigate the impact of a non-optimal training process in comparison to an optimized training process. While model architectures optimized for aerial images, such as PFNet or our own model, show an advantage under non-optimal conditions, this advantage disappears under optimal training conditions. Despite this observation, our model still makes better predictions for small rectangular objects, e.g., cars.
