Table of Contents
Fetching ...

SpecDM: Hyperspectral Dataset Synthesis with Pixel-level Semantic Annotations

Wendi Liu, Pei Yang, Wenhui Hong, Xiaoguang Mei, Jiayi Ma

TL;DR

This work tackles the scarcity of labeled hyperspectral data for dense prediction tasks by introducing SpecDM, a diffusion-based pipeline that learns a joint image-label latent distribution. It employs a two-stream VAE to encode HSIs and pixel-level annotations into latent codes $(z_x, z_y)$ and trains a latent diffusion model to synthesize realistic image-label pairs, enabling both semantic segmentation and change detection datasets. The approach demonstrates improved downstream performance and spectral fidelity on SegMunich and OSCD compared with baselines, validating the usefulness of synthetic labeled HSIs for data-hungry tasks. A key limitation is the challenge of automatically verifying pixel-level alignment between generated images and annotations without ground truth, which the authors acknowledge for future work.

Abstract

In hyperspectral remote sensing field, some downstream dense prediction tasks, such as semantic segmentation (SS) and change detection (CD), rely on supervised learning to improve model performance and require a large amount of manually annotated data for training. However, due to the needs of specific equipment and special application scenarios, the acquisition and annotation of hyperspectral images (HSIs) are often costly and time-consuming. To this end, our work explores the potential of generative diffusion model in synthesizing HSIs with pixel-level annotations. The main idea is to utilize a two-stream VAE to learn the latent representations of images and corresponding masks respectively, learn their joint distribution during the diffusion model training, and finally obtain the image and mask through their respective decoders. To the best of our knowledge, it is the first work to generate high-dimensional HSIs with annotations. Our proposed approach can be applied in various kinds of dataset generation. We select two of the most widely used dense prediction tasks: semantic segmentation and change detection, and generate datasets suitable for these tasks. Experiments demonstrate that our synthetic datasets have a positive impact on the improvement of these downstream tasks.

SpecDM: Hyperspectral Dataset Synthesis with Pixel-level Semantic Annotations

TL;DR

This work tackles the scarcity of labeled hyperspectral data for dense prediction tasks by introducing SpecDM, a diffusion-based pipeline that learns a joint image-label latent distribution. It employs a two-stream VAE to encode HSIs and pixel-level annotations into latent codes and trains a latent diffusion model to synthesize realistic image-label pairs, enabling both semantic segmentation and change detection datasets. The approach demonstrates improved downstream performance and spectral fidelity on SegMunich and OSCD compared with baselines, validating the usefulness of synthetic labeled HSIs for data-hungry tasks. A key limitation is the challenge of automatically verifying pixel-level alignment between generated images and annotations without ground truth, which the authors acknowledge for future work.

Abstract

In hyperspectral remote sensing field, some downstream dense prediction tasks, such as semantic segmentation (SS) and change detection (CD), rely on supervised learning to improve model performance and require a large amount of manually annotated data for training. However, due to the needs of specific equipment and special application scenarios, the acquisition and annotation of hyperspectral images (HSIs) are often costly and time-consuming. To this end, our work explores the potential of generative diffusion model in synthesizing HSIs with pixel-level annotations. The main idea is to utilize a two-stream VAE to learn the latent representations of images and corresponding masks respectively, learn their joint distribution during the diffusion model training, and finally obtain the image and mask through their respective decoders. To the best of our knowledge, it is the first work to generate high-dimensional HSIs with annotations. Our proposed approach can be applied in various kinds of dataset generation. We select two of the most widely used dense prediction tasks: semantic segmentation and change detection, and generate datasets suitable for these tasks. Experiments demonstrate that our synthetic datasets have a positive impact on the improvement of these downstream tasks.

Paper Structure

This paper contains 16 sections, 4 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Overview of our approach. (a) In the training stage, we design a two-stream VAE to compress HSIs and corresponding masks from pixel space to latent space, and then train a denoising U-Net on the joint representations. The latent representation is split to feed forward to corresponding decoders to complete the reconstruction. (b) In the inference stage, after training the generator $\mathcal{G}$, we start from the noised sample $z_T$ and obtain the synthetic image-mask pairs through decoders, to augment the original real dataset when training the downstream task models.
  • Figure 2: Generated samples-SegMunich. We visualize several pairs of HSIs (shown in false-color) and corresponding segmentation maps generated by the baseline method LDM and our method respectively, comparing to the real samples.
  • Figure 3: Spectral profile comparison. We visualize the spectral response of our generated samples, comparing to real samples. We sample the pixels of several typical landforms according to the annotations. The intensity of spectral responses of the same landform keep consistent in different HSIs and are close to the real samples.
  • Figure 4: Distribution of landform classes, illustrating that the set of our generated samples closely matches the real distribution. The proportion of several main classes are very close (Arable land, Pastures and Forests).
  • Figure 5: Generated samples-OSCD. We visualize several samples generated by our method and highlight the changed regions. The masks can annotate these changes in high accuracy.