Table of Contents
Fetching ...

A Gift from the Integration of Discriminative and Diffusion-based Generative Learning: Boundary Refinement Remote Sensing Semantic Segmentation

Hao Wang, Keyan Hu, Xin Guo, Haifeng Li, Chao Tao

TL;DR

The paper tackles boundary misalignment in remote sensing semantic segmentation by combining discriminative coarse predictions with diffusion-based refinement. It introduces IDGBR, a two-stage framework with a conditional guidance network and a representation-alignment regularizer to fuse semantic correctness with boundary precision. Through latent diffusion in a two-stage pipeline and extensive experiments across five datasets, the approach yields consistent boundary improvements (measured by WF_m) across architectures and task types. The work demonstrates the practical impact of boundary-aware refinement for mapping pipelines and highlights the benefits of conditional guidance and staged training in achieving coherent semantic and boundary representations.

Abstract

Remote sensing semantic segmentation must address both what the ground objects are within an image and where they are located. Consequently, segmentation models must ensure not only the semantic correctness of large-scale patches (low-frequency information) but also the precise localization of boundaries between patches (high-frequency information). However, most existing approaches rely heavily on discriminative learning, which excels at capturing low-frequency features, while overlooking its inherent limitations in learning high-frequency features for semantic segmentation. Recent studies have revealed that diffusion generative models excel at generating high-frequency details. Our theoretical analysis confirms that the diffusion denoising process significantly enhances the model's ability to learn high-frequency features; however, we also observe that these models exhibit insufficient semantic inference for low-frequency features when guided solely by the original image. Therefore, we integrate the strengths of both discriminative and generative learning, proposing the Integration of Discriminative and diffusion-based Generative learning for Boundary Refinement (IDGBR) framework. The framework first generates a coarse segmentation map using a discriminative backbone model. This map and the original image are fed into a conditioning guidance network to jointly learn a guidance representation subsequently leveraged by an iterative denoising diffusion process refining the coarse segmentation. Extensive experiments across five remote sensing semantic segmentation datasets (binary and multi-class segmentation) confirm our framework's capability of consistent boundary refinement for coarse results from diverse discriminative architectures.

A Gift from the Integration of Discriminative and Diffusion-based Generative Learning: Boundary Refinement Remote Sensing Semantic Segmentation

TL;DR

The paper tackles boundary misalignment in remote sensing semantic segmentation by combining discriminative coarse predictions with diffusion-based refinement. It introduces IDGBR, a two-stage framework with a conditional guidance network and a representation-alignment regularizer to fuse semantic correctness with boundary precision. Through latent diffusion in a two-stage pipeline and extensive experiments across five datasets, the approach yields consistent boundary improvements (measured by WF_m) across architectures and task types. The work demonstrates the practical impact of boundary-aware refinement for mapping pipelines and highlights the benefits of conditional guidance and staged training in achieving coherent semantic and boundary representations.

Abstract

Remote sensing semantic segmentation must address both what the ground objects are within an image and where they are located. Consequently, segmentation models must ensure not only the semantic correctness of large-scale patches (low-frequency information) but also the precise localization of boundaries between patches (high-frequency information). However, most existing approaches rely heavily on discriminative learning, which excels at capturing low-frequency features, while overlooking its inherent limitations in learning high-frequency features for semantic segmentation. Recent studies have revealed that diffusion generative models excel at generating high-frequency details. Our theoretical analysis confirms that the diffusion denoising process significantly enhances the model's ability to learn high-frequency features; however, we also observe that these models exhibit insufficient semantic inference for low-frequency features when guided solely by the original image. Therefore, we integrate the strengths of both discriminative and generative learning, proposing the Integration of Discriminative and diffusion-based Generative learning for Boundary Refinement (IDGBR) framework. The framework first generates a coarse segmentation map using a discriminative backbone model. This map and the original image are fed into a conditioning guidance network to jointly learn a guidance representation subsequently leveraged by an iterative denoising diffusion process refining the coarse segmentation. Extensive experiments across five remote sensing semantic segmentation datasets (binary and multi-class segmentation) confirm our framework's capability of consistent boundary refinement for coarse results from diverse discriminative architectures.

Paper Structure

This paper contains 38 sections, 27 equations, 17 figures, 17 tables, 2 algorithms.

Figures (17)

  • Figure 1: A comparative evaluation between a diffusion model and a discriminative model for semantic segmentation of remote sensing imagery was conducted, with both models fully trained on the Potsdam dataset. The comparison is carried out from two perspectives: (a) visualization from a frequency-based viewpoint, and (b) overall semantic reasoning accuracy. The results reveal complementary characteristics between the two models: the diffusion model exhibits an inherent advantage in generating fine-grained boundaries (high-frequency information), while the discriminative model demonstrates stronger capability in semantic reasoning (low-frequency information).
  • Figure 2: Generative semantic segmentation pipeline using a latent diffusion model.
  • Figure 3: Analysis of the effectiveness of diffusion-based generative learning for boundary segmentation. (a) Qualitative assessment of inference accuracy on high-frequency and low-frequency components. (b) Theoretical analysis of how the diffusion denoising process enhances high-frequency component learning. The horizontal axis represents relative spatial frequency ($f$), where smaller $|f|$ values correspond to low-frequency structures and larger $|f|$ values represent high-frequency details; the vertical axis is the normalized frequency response. The curves illustrate the dynamic evolution of the retention of different frequency components during the reverse denoising process.
  • Figure 4: Unified framework for remote sensing image segmentation merging discriminative models with diffusion generators.
  • Figure 5: Input and Output Data Flow of Conditional Guidance Network (Illustrated with an Encoding Module and its Corresponding Conditional Guidance Module)
  • ...and 12 more figures