Table of Contents
Fetching ...

SegForestNet: Spatial-Partitioning-Based Aerial Image Segmentation

Daniel Gritzner, Jörn Ostermann

TL;DR

SegForestNet advances aerial image segmentation by enforcing polygon-like predictions through BSP trees, enabling a robust, end-to-end differentiable pipeline. The authors introduce three refinements—a residual decoder architecture with improved gradient flow, a region-map–specific loss to sharpen partitions, and the ability to predict class-specific BSP trees—along with an analysis showing training process quality can surpass domain-specific architectural gains. Evaluations across eight diverse datasets show state-of-the-art or competitive performance, with notable benefits for small rectangular objects like cars; they also demonstrate that optimized training can level the playing field with generic segmentation models. The work suggests practical impact in map construction and environmental monitoring, and points to future directions in learning optimal tree types and extending to instance or panoptic segmentation.

Abstract

Aerial image segmentation is the basis for applications such as automatically creating maps or tracking deforestation. In true orthophotos, which are often used in these applications, many objects and regions can be approximated well by polygons. However, this fact is rarely exploited by state-of-the-art semantic segmentation models. Instead, most models allow unnecessary degrees of freedom in their predictions by allowing arbitrary region shapes. We therefore present a refinement of our deep learning model which predicts binary space partitioning trees, an efficient polygon representation. The refinements include a new feature decoder architecture and a new differentiable BSP tree renderer which both avoid vanishing gradients. Additionally, we designed a novel loss function specifically designed to improve the spatial partitioning defined by the predicted trees. Furthermore, our expanded model can predict multiple trees at once and thus can predict class-specific segmentations. As an additional contribution, we investigate the impact of a non-optimal training process in comparison to an optimized training process. While model architectures optimized for aerial images, such as PFNet or our own model, show an advantage under non-optimal conditions, this advantage disappears under optimal training conditions. Despite this observation, our model still makes better predictions for small rectangular objects, e.g., cars.

SegForestNet: Spatial-Partitioning-Based Aerial Image Segmentation

TL;DR

SegForestNet advances aerial image segmentation by enforcing polygon-like predictions through BSP trees, enabling a robust, end-to-end differentiable pipeline. The authors introduce three refinements—a residual decoder architecture with improved gradient flow, a region-map–specific loss to sharpen partitions, and the ability to predict class-specific BSP trees—along with an analysis showing training process quality can surpass domain-specific architectural gains. Evaluations across eight diverse datasets show state-of-the-art or competitive performance, with notable benefits for small rectangular objects like cars; they also demonstrate that optimized training can level the playing field with generic segmentation models. The work suggests practical impact in map construction and environmental monitoring, and points to future directions in learning optimal tree types and extending to instance or panoptic segmentation.

Abstract

Aerial image segmentation is the basis for applications such as automatically creating maps or tracking deforestation. In true orthophotos, which are often used in these applications, many objects and regions can be approximated well by polygons. However, this fact is rarely exploited by state-of-the-art semantic segmentation models. Instead, most models allow unnecessary degrees of freedom in their predictions by allowing arbitrary region shapes. We therefore present a refinement of our deep learning model which predicts binary space partitioning trees, an efficient polygon representation. The refinements include a new feature decoder architecture and a new differentiable BSP tree renderer which both avoid vanishing gradients. Additionally, we designed a novel loss function specifically designed to improve the spatial partitioning defined by the predicted trees. Furthermore, our expanded model can predict multiple trees at once and thus can predict class-specific segmentations. As an additional contribution, we investigate the impact of a non-optimal training process in comparison to an optimized training process. While model architectures optimized for aerial images, such as PFNet or our own model, show an advantage under non-optimal conditions, this advantage disappears under optimal training conditions. Despite this observation, our model still makes better predictions for small rectangular objects, e.g., cars.
Paper Structure (21 sections, 31 equations, 12 figures, 15 tables)

This paper contains 21 sections, 31 equations, 12 figures, 15 tables.

Figures (12)

  • Figure 1: Overview of our approach: our model predicts binary space partitioning (BSP) trees from aerial images. The inner nodes of such a tree define the shape of regions, while the leaf nodes define the content of each region. The leaf nodes effectively map shapes to classes. BSP trees can be rendered in a differentiable way into a full segmentation thus enabling end-to-end model training. Our proposed refinements improve gradient computations in the BSP renderer and parts of the model. Additionally, a novel loss function improves the predicted inner node parameters, i.e., the predicted shapes. Furthermore, we extended the approach to predict multiple trees at the same time in order to enable class-specific shape predictions.
  • Figure 1: The decision boundaries created by signed distance functions based on different geometric primitives. From left to right: line, square, circle, ellipse, hyperbola, parabola. The blue area shows points for which the respective signed distance function is non-negative. As an example, in the inside of the circle $f_3$ is negative whereas it is positive on the outside.
  • Figure 2: A comparison of state-of-the-art models (top) and BSPSegNet/SegForestNet (bottom; ours). All models use an encoder-decoder architecture, however, our models semantically splits the feature map (hexagon) into shape and content features. These are decoded into the inner nodes and leaf nodes of BSP trees respectively (see Fig. \ref{['fig:bsptree']}). Our models also needs a differentiable BSP renderer to enable end-to-end training. The renderer is a fixed function without learnable parameters.
  • Figure 2: Visualization of a $k$-d tree (left) and a region it partitions (right). The parameters of each inner node (blue) are a fixed dimension, indicated by the orientation of the line used for partitioning, and a predicted threshold, indicated by the position of the line along the fixed dimension.
  • Figure 3: A partitioning of a square region (right) defined by a BSP tree (center). Shape features are decoded into the parameters of the inner nodes (blue), which define lines (green) creating the partitioning. Content features are decoded into the parameters of the leaf nodes (orange), which are the class logits predicted for each partition.
  • ...and 7 more figures