Table of Contents
Fetching ...

Multi-Class Segmentation from Aerial Views using Recursive Noise Diffusion

Benedikt Kolbeinsson, Krystian Mikolajczyk

TL;DR

This work addresses the challenge of multi-class semantic segmentation for aerial imagery by introducing a recursive denoising diffusion framework with hierarchical multi-scale processing. The method defines a forward diffusion on segmentation maps conditioned on RGB input and learns a denoiser that can predict segmentation across arbitrary time steps, enhanced by training with recursive denoising and a multi-scale strategy. It reports strong results on UAVid and state-of-the-art performance on Vaihingen Buildings, illustrating the potential of diffusion-based, multi-class aerial segmentation. The approach offers flexibility in noise functions, diffusion models, and losses, and highlights practical considerations such as inference-time trade-offs and data requirements, paving the way for future improvements and broader applications.

Abstract

Semantic segmentation from aerial views is a crucial task for autonomous drones, as they rely on precise and accurate segmentation to navigate safely and efficiently. However, aerial images present unique challenges such as diverse viewpoints, extreme scale variations, and high scene complexity. In this paper, we propose an end-to-end multi-class semantic segmentation diffusion model that addresses these challenges. We introduce recursive denoising to allow information to propagate through the denoising process, as well as a hierarchical multi-scale approach that complements the diffusion process. Our method achieves promising results on the UAVid dataset and state-of-the-art performance on the Vaihingen Building segmentation benchmark. Being the first iteration of this method, it shows great promise for future improvements.

Multi-Class Segmentation from Aerial Views using Recursive Noise Diffusion

TL;DR

This work addresses the challenge of multi-class semantic segmentation for aerial imagery by introducing a recursive denoising diffusion framework with hierarchical multi-scale processing. The method defines a forward diffusion on segmentation maps conditioned on RGB input and learns a denoiser that can predict segmentation across arbitrary time steps, enhanced by training with recursive denoising and a multi-scale strategy. It reports strong results on UAVid and state-of-the-art performance on Vaihingen Buildings, illustrating the potential of diffusion-based, multi-class aerial segmentation. The approach offers flexibility in noise functions, diffusion models, and losses, and highlights practical considerations such as inference-time trade-offs and data requirements, paving the way for future improvements and broader applications.

Abstract

Semantic segmentation from aerial views is a crucial task for autonomous drones, as they rely on precise and accurate segmentation to navigate safely and efficiently. However, aerial images present unique challenges such as diverse viewpoints, extreme scale variations, and high scene complexity. In this paper, we propose an end-to-end multi-class semantic segmentation diffusion model that addresses these challenges. We introduce recursive denoising to allow information to propagate through the denoising process, as well as a hierarchical multi-scale approach that complements the diffusion process. Our method achieves promising results on the UAVid dataset and state-of-the-art performance on the Vaihingen Building segmentation benchmark. Being the first iteration of this method, it shows great promise for future improvements.
Paper Structure (30 sections, 16 equations, 11 figures, 5 tables, 2 algorithms)

This paper contains 30 sections, 16 equations, 11 figures, 5 tables, 2 algorithms.

Figures (11)

  • Figure 1: A high level illustration of the recursive diffusion concept. The diffusion model is conditioned on the input image as well as the previous segmentation prediction at various scales, before returning the final semantic segmentation map.
  • Figure 2: Overview of the recursive noise diffusion process. The noise function diffuses the previous predicted segmentation. The model denoises the diffused segmentation given a conditioning RGB image. Finally, the denoised predicted segmentation is compared to the ground truth. The segmentation is initialized as pure noise. Notably, the ground truth segmentation is never used as part of the input to the model. This process is agnostic to the choice of noise function, diffusion model and loss.
  • Figure 3: The hierarchical multi-scale process. A down-scaled input is first denoised for half the time steps before up-scaling to the original resolution (bilinear intorpolation) for the remaining time steps. It can be noted that large structures appear first while finer detail appear later.
  • Figure 4: WNetFormer model architecture. Converting UNetFormer to a diffusion model. UNetFormer consists of global-local transformer blocks (GLTB), weighted sums (WS) and a feature refinement head (FRH). The diffused segmentation head consists of down-sampling (bilinear interpolation) and concatenation. The time step is concatenated, channel-wise, to the diffused segmentation.
  • Figure 5: Qualitative results on Vaihingen Buildings cramer2010dgpf. First row: input image, middle row: our method, bottom row: ground truth.
  • ...and 6 more figures