Table of Contents
Fetching ...

SatSynth: Augmenting Image-Mask Pairs through Diffusion Models for Aerial Semantic Segmentation

Aysim Toker, Marvin Eisenberger, Daniel Cremers, Laura Leal-Taixé

TL;DR

This work is the first to generate both images and corresponding masks for satellite segmentation, and demonstrates that integrating generated samples yields significant quantitative improvements for satellite semantic segmentation - both compared to baselines and when training only on the original data.

Abstract

In recent years, semantic segmentation has become a pivotal tool in processing and interpreting satellite imagery. Yet, a prevalent limitation of supervised learning techniques remains the need for extensive manual annotations by experts. In this work, we explore the potential of generative image diffusion to address the scarcity of annotated data in earth observation tasks. The main idea is to learn the joint data manifold of images and labels, leveraging recent advancements in denoising diffusion probabilistic models. To the best of our knowledge, we are the first to generate both images and corresponding masks for satellite segmentation. We find that the obtained pairs not only display high quality in fine-scale features but also ensure a wide sampling diversity. Both aspects are crucial for earth observation data, where semantic classes can vary severely in scale and occurrence frequency. We employ the novel data instances for downstream segmentation, as a form of data augmentation. In our experiments, we provide comparisons to prior works based on discriminative diffusion models or GANs. We demonstrate that integrating generated samples yields significant quantitative improvements for satellite semantic segmentation -- both compared to baselines and when training only on the original data.

SatSynth: Augmenting Image-Mask Pairs through Diffusion Models for Aerial Semantic Segmentation

TL;DR

This work is the first to generate both images and corresponding masks for satellite segmentation, and demonstrates that integrating generated samples yields significant quantitative improvements for satellite semantic segmentation - both compared to baselines and when training only on the original data.

Abstract

In recent years, semantic segmentation has become a pivotal tool in processing and interpreting satellite imagery. Yet, a prevalent limitation of supervised learning techniques remains the need for extensive manual annotations by experts. In this work, we explore the potential of generative image diffusion to address the scarcity of annotated data in earth observation tasks. The main idea is to learn the joint data manifold of images and labels, leveraging recent advancements in denoising diffusion probabilistic models. To the best of our knowledge, we are the first to generate both images and corresponding masks for satellite segmentation. We find that the obtained pairs not only display high quality in fine-scale features but also ensure a wide sampling diversity. Both aspects are crucial for earth observation data, where semantic classes can vary severely in scale and occurrence frequency. We employ the novel data instances for downstream segmentation, as a form of data augmentation. In our experiments, we provide comparisons to prior works based on discriminative diffusion models or GANs. We demonstrate that integrating generated samples yields significant quantitative improvements for satellite semantic segmentation -- both compared to baselines and when training only on the original data.
Paper Structure (33 sections, 5 equations, 13 figures, 6 tables)

This paper contains 33 sections, 5 equations, 13 figures, 6 tables.

Figures (13)

  • Figure 1: We leverage generative image diffusion to synthesize novel training data instances $(\mathbf{x}',\mathbf{y}')\sim p(\mathbf{x},\mathbf{y})$ for a given labeled earth observation dataset waqas2019isaidxia2023openearthmapwang2021loveda. In our experiments, we demonstrate that integrating such synthetic pairs as training data for downstream semantic segmentation yields significant quantitative improvements.
  • Figure 2: Approach overview. (a) We train a generative image diffusion model $\mathcal{G}$ on the joint data instances $(\mathbf{x}_{i},\mathbf{y}_{i})\in\mathcal{D}$ of images $\mathbf{x}_{i}$ and corresponding labels $\mathbf{y}_{i}$. We then employ $\mathcal{G}$ to generate a dataset $\mathcal{D}'$ of novel training samples $(\mathbf{x}_{i}',\mathbf{y}_{i}')$. (b) Both the real $\mathcal{D}$ and generated $\mathcal{D}'$ pairs are integrated and leveraged for the downstream semantic segmentation task. (c) Moreover, we compare the resulting distributions of foreground classes, highlighting that the set of generated labels in $\mathcal{D}'$ closely matches the original distribution $\mathcal{D}$. For a legend of label acronyms, refer to \ref{['fig:teaser']} (a).
  • Figure 3: Generated samples, iSAID waqas2019isaid. We visualize several pairs $(\mathbf{x}_{i}',\mathbf{y}_{i}')$ sampled from the diffusion model $\mathcal{G}$ detailed in \ref{['subsec:approach']}. Color coding for the semantic masks $\mathbf{y}_{i}'$ is indicated by the corresponding palette legend (top right). The generated scenes are of high quality and the semantic layout is coherent -- for instance, soccer ball fields are frequently surrounded by ground track fields (bottom, 4th).
  • Figure 4: Generated samples, LoveDA wang2021loveda. We display pairs $(\mathbf{x}_{i}',\mathbf{y}_{i}')$ generated by $\mathcal{G}$ on LoveDA wang2021loveda. The obtained satellite scenes consist of visually plausible images $\mathbf{x}_{i}'$ and corresponding semantic masks $\mathbf{y}_{i}'$ for general land-cover classes.
  • Figure 5: Analysis of synthetic data. We assess the impact of generated samples $\mathcal{D}\cup\mathcal{D}'$ on the mIoU segmentation score for iSAID waqas2019isaid, LoveDA wang2021loveda, and OpenEarthMap xia2023openearthmap, with a spatial size of $128\times128$. Different resampling ratios are applied, defined as sampling $R\in\{0,\dots,5\}$ synthetic pairs per original instance, i.e., $|\mathcal{D}'|=R\cdot|\mathcal{D}|$ pairs in total. In each case, error bars are provided which denote the standard error (SE). We separately plot the accuracies without synthetic samples $R=0$ (gray dashed lines) for ease of comparison.
  • ...and 8 more figures